I'm not sure it's related but I encountered a similar issue a few months ago. In my case, it was an "at" command sending a kill signal to the at daemon with it's correct pid. Somehow, once in a while this signal got to Cassandra process (Java as well) and killed it. After some time of investigation I assumed this have to be a kernel bug or something and I've opened a ticket for CentOS - http://bugs.centos.org/view.php?id=7539 which no body is really looking at :) You can read there how I tried to tackle it. Bottom line, we've changed the at scheduler to a different implementation and we don't get this issue any more.
HTH, Or. On Wed, Feb 11, 2015 at 3:39 AM, David chen <[email protected]> wrote: > The command 'service ntpd stop' could be triggered around 14:00. > Because the crontab was set as follows: > 0 * * * * sh sync.sh > The script contains the following command: > #!/bin/bash > service ntpd stop > ntpdate 192.168.0.1 #it's a valid ntpd server in LAN > service ntpd start > chkconfig ntpd on > > Found the following fragment in /var/log/message: > Jan 7 14:00:01 host1 ntpd[32101]: ntpd exiting on signal 15 > Jan 7 13:59:59 host1 ntpd[44764]: ntpd [email protected] Fri Feb 22 > 11:23:27 UTC 2013 (1) > Jan 7 13:59:59 host1 ntpd[44765]: precision = 0.143 usec > Jan 7 13:59:59 host1 ntpd[44765]: Listening on interface #0 wildcard, > 0.0.0.0#123 Disabled > Jan 7 13:59:59 host1 ntpd[44765]: Listening on interface #1 wildcard, > ::#123 Disabled > Jan 7 13:59:59 host1 ntpd[44765]: Listening on interface #2 lo, ::1#123 > Enabled > Jan 7 13:59:59 host1 ntpd[44765]: Listening on interface #3 em2, > fe80::ca1f:66ff:fee1:eed#123 Enabled > Jan 7 13:59:59 host1 ntpd[44765]: Listening on interface #4 lo, > 127.0.0.1#123 Enabled > Jan 7 13:59:59 host1 ntpd[44765]: Listening on interface #5 em2, > 192.168.1.151#123 Enabled > Jan 7 13:59:59 host1 ntpd[44765]: Listening on routing socket on fd #22 > for interface updates > Jan 7 13:59:59 host1 ntpd[44765]: kernel time sync status 2040 > Jan 7 13:59:59 host1 ntpd[44765]: frequency initialized 499.399 PPM from > /var/lib/ntp/drift > Jan 7 14:00:01 host1 ntpd_initres[32103]: parent died before we finished, > exiting > Jan 7 14:04:17 host1 ntpd[44765]: synchronized to 192.168.0.191, stratum 2 > Jan 7 14:04:17 host1 ntpd[44765]: kernel time sync status change 2001 > Jan 7 14:26:02 host1 snmpd[4842]: Received TERM or STOP signal... > shutting down... > Jan 7 14:26:02 host1 kernel: netlink: 12 bytes leftover after parsing > attributes. > Jan 7 14:26:02 host1 snmpd[45667]: NET-SNMP version 5.5 > Jan 7 14:52:48 host1 ntpd[44765]: no servers reachable > > It looks likely the command 'service ntpd stop' send the SIGTERM signal. > The above clue 'ntpd[32101]' indicates that the ntpd process PID is 32101, > inspect NameNode log, i found that the NameNode process PID was not > identical with ntpd. > So i wonder why Namenode process can received the signal? > -- Or Sher
