I’ve attached some more detailed stack trace as well.
Here’s what my replication agreements look like:
[root@sso-107 (NY) ~]$ ipa-replica-manage list
sso-108.nym1.placeiq.net <http://sso-108.nym1.placeiq.net>: master
sso-110.nym1.placeiq.net <http://sso-110.nym1.placeiq.net>: master
sso-107.nym1.placeiq.net <http://sso-107.nym1.placeiq.net>: master
sso-109.nym1.placeiq.net <http://sso-109.nym1.placeiq.net>: master
[root@sso-107 (NY) ~]$ ipa-replica-manage list
sso-107.nym1.placeiq.net <http://sso-107.nym1.placeiq.net>
sso-108.nym1.placeiq.net <http://sso-108.nym1.placeiq.net>: replica
sso-110.nym1.placeiq.net <http://sso-110.nym1.placeiq.net>: replica
[root@sso-107 (NY) ~]$ ipa-replica-manage list
sso-108.nym1.placeiq.net <http://sso-108.nym1.placeiq.net>
sso-107.nym1.placeiq.net <http://sso-107.nym1.placeiq.net>: replica
sso-109.nym1.placeiq.net <http://sso-109.nym1.placeiq.net>: replica
[root@sso-107 (NY) ~]$ ipa-replica-manage list
sso-109.nym1.placeiq.net <http://sso-109.nym1.placeiq.net>
sso-108.nym1.placeiq.net <http://sso-108.nym1.placeiq.net>: replica
sso-110.nym1.placeiq.net <http://sso-110.nym1.placeiq.net>: replica
[root@sso-107 (NY) ~]$ ipa-replica-manage list
sso-110.nym1.placeiq.net <http://sso-110.nym1.placeiq.net>
sso-107.nym1.placeiq.net <http://sso-107.nym1.placeiq.net>: replica
sso-109.nym1.placeiq.net <http://sso-109.nym1.placeiq.net>: replica
SSO-107
top - 15:58:08 up 2 days, 10:00, 1 user, load average: 0.00, 0.03, 0.06
Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie
Cpu(s): 12.2%us, 1.1%sy, 0.0%ni, 86.7%id, 0.1%wa, 0.0%hi, 0.0%si,
0.0%st
Mem: 2952788k total, 2160216k used, 792572k free, 182584k buffers
Swap: 4094972k total, 0k used, 4094972k free, 678292k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
11615 dirsrv 20 0 2063m 843m 19m S 25.5 29.3 403:53.56 ns-slapd
[root@sso-107 (NY) /var/log/dirsrv/slapd-PLACEIQ-NET]$ ls -al
/proc/`cat /var/run/dirsrv/slapd-PLACEIQ-NET.pid`/fd|grep socket|wc -l
245
SSO-108
top - 15:57:26 up 3 days, 17:25, 1 user, load average: 0.03, 0.03, 0.00
Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.3%us, 0.2%sy, 0.0%ni, 99.4%id, 0.1%wa, 0.0%hi, 0.0%si,
0.0%st
Mem: 2952788k total, 2200792k used, 751996k free, 182084k buffers
Swap: 4094972k total, 0k used, 4094972k free, 713848k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
24399 dirsrv 20 0 2055m 819m 19m S 0.8 28.4 54:48.53 ns-slapd
[root@sso-108 (NY) /var/log/dirsrv/slapd-PLACEIQ-NET]$ ls -al
/proc/`cat /var/run/dirsrv/slapd-PLACEIQ-NET.pid`/fd|grep socket|wc -l
232
SSO-109
top - 16:00:05 up 4 days, 9:10, 1 user, load average: 0.06, 0.32, 0.35
Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.7%us, 0.3%sy, 0.0%ni, 98.9%id, 0.2%wa, 0.0%hi, 0.0%si,
0.0%st
Mem: 2952788k total, 2422572k used, 530216k free, 235472k buffers
Swap: 4094972k total, 0k used, 4094972k free, 906080k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
22522 dirsrv 20 0 2065m 772m 19m S 1.2 26.8 308:13.07 ns-slapd
[root@sso-109 (NY) ~]$ ls -al /proc/`cat
/var/run/dirsrv/slapd-PLACEIQ-NET.pid`/fd|grep socket|wc -l
219
SSO-110
top - 16:07:54 up 14 days, 18:03, 1 user, load average: 0.00, 0.01, 0.00
Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie
Cpu(s): 2.0%us, 1.0%sy, 0.0%ni, 96.7%id, 0.3%wa, 0.0%hi, 0.0%si,
0.0%st
Mem: 2952788k total, 2304556k used, 648232k free, 155216k buffers
Swap: 4094972k total, 64k used, 4094908k free, 748972k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2401 dirsrv 20 0 2074m 839m 18m S 4.8 29.1 48:25.58 ns-slapd
[root@sso-110 (NY) /var/log/dirsrv/slapd-PLACEIQ-NET]$ ls -al
/proc/`cat /var/run/dirsrv/slapd-PLACEIQ-NET.pid`/fd|grep socket|wc -l
257
Jim Richard | PlaceIQ
<http://www.google.com/url?q=http%3A%2F%2Fwww.placeiq.com%2F&sa=D&sntz=1&usg=AFrqEzcYjZpDPyqW7feNK9EgLq-c9JlHiw> |
Systems Administrator | jrich...@placeiq.com
<mailto:n...@placeiq.com> | +1 (646) 338-8905
On Feb 19, 2015, at 9:33 AM, Rich Megginson <rmegg...@redhat.com
<mailto:rmegg...@redhat.com>> wrote:
On 02/18/2015 11:05 PM, Jatin Nansi wrote:
Check the ns-slapd access and error logs of the DS instance hosting
the IPA instance. The strace output indicates that ns-slapd spent
most of its time waiting for network activity to happen (poll),
which is normal for ns-slapd.
The number of open connections correlates to the CPU usage. Do this:
# ls -al /proc/`cat /var/run/dirsrv/slapd-MY-DOMAIN.pid`/fd|grep
socket|wc -l
How many socket connections do you have?
Also, it will be very useful to get some stack traces of the running
server to see what the various threads are doing. See
http://www.port389.org/docs/389ds/FAQ/faq.html#debugging-hangs
Jatin
On 19/02/15 15:52, Jim Richard wrote:
I’ve got 4 Redhat IDM masters in a multi-master
config. 3.0.0-42.el6.centos is the IPA version, 389-ds-base version
1.2.11.15-48.el6_6, Centos 6.6
Monitoring established connections on port 389 and dsInOps over
time shows a consistent/even level of activity however 2 of the 4
IPA servers show ever increasing CPI usage by ns-slapd. One
ns-slapd process will start to show increased CPU for a time, then
drop off as another then increases. This cycle continues with each
switch seeing more and more total SPU usage by ns-slapd.
strace timing for the offending ns-slapd looks like the following:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
96.12 9.342272 1133 8243 poll
3.86 0.375457 53 7066 41 futex
0.01 0.000668 0 8244 8244 getpeername
0.00 0.000374 0 929 close
0.00 0.000368 0 3201 read
0.00 0.000151 0 882 setsockopt
0.00 0.000095 2 42 access
0.00 0.000033 0 1365 fcntl
0.00 0.000000 0 42 open
0.00 0.000000 0 39 stat
0.00 0.000000 0 42 fstat
0.00 0.000000 0 1 madvise
0.00 0.000000 0 441 accept
0.00 0.000000 0 441 getsockname
0.00 0.000000 0 1 restart_syscall
------ ----------- ----------- --------- --------- ----------------
100.00 9.719418 30979 8285 total
I have carefully reviewed cn=config settings on all four master
servers to confirm that they match.
Based on this strace output can you perhaps point me in the right
direction, give me a clue on what I should be looking at.
Here’s a screen shot of my Zabbix reporting to help describe the
problem. Note the graph in the bottom right corner.
The problem is most certainly related to replication but I just
don’t know what specifically to look at.
<Mail Attachment.png>
Thanks in advance for any clues you can provide.
Jim Richard | PlaceIQ
<http://www.google.com/url?q=http%3A%2F%2Fwww.placeiq.com%2F&sa=D&sntz=1&usg=AFrqEzcYjZpDPyqW7feNK9EgLq-c9JlHiw> |
Systems Administrator | jrich...@placeiq.com
<mailto:n...@placeiq.com> | +1 (646) 338-8905
--
Manage your subscription for the Freeipa-users mailing list:
https://www.redhat.com/mailman/listinfo/freeipa-users
Go To http://freeipa.org for more info on the project