A shell script is deployed on every node of HDFS cluster, the script is invoked 
hourly by crontab, and its content is as follows:
#!/bin/bash
service ntpd stop
ntpdate 192.168.0.1 #it's a valid ntpd server in LAN
service ntpd start
chkconfig ntpd on


After several days, NameNode crashed suddenly, but its log seemed no other 
errors except the following:
2015-01-07 14:00:00,709 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: 
RECEIVED SIGNAL 15: SIGTERM


Inspected the Linux log(Centos /var/log/messages), also found the following 
clues:
Jan  7 14:00:01 host1 ntpd[32101]: ntpd exiting on signal 15
Jan  7 13:59:59 host1 ntpd[44764]: ntpd [email protected] Fri Feb 22 11:23:27 
UTC 2013 (1)
Jan  7 13:59:59 host1 ntpd[44765]: precision = 0.143 usec
Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #0 wildcard, 
0.0.0.0#123 Disabled
Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #1 wildcard, ::#123 
Disabled
Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #2 lo, ::1#123 Enabled
Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #3 em2, 
fe80::ca1f:66ff:fee1:eed#123 Enabled
Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #4 lo, 127.0.0.1#123 
Enabled
Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #5 em2, 
192.168.1.151#123 Enabled
Jan  7 13:59:59 host1 ntpd[44765]: Listening on routing socket on fd #22 for 
interface updates
Jan  7 13:59:59 host1 ntpd[44765]: kernel time sync status 2040
Jan  7 13:59:59 host1 ntpd[44765]: frequency initialized 499.399 PPM from 
/var/lib/ntp/drift
Jan  7 14:00:01 host1 ntpd_initres[32103]: parent died before we finished, 
exiting
Jan  7 14:04:17 host1 ntpd[44765]: synchronized to 192.168.0.191, stratum 2
Jan  7 14:04:17 host1 ntpd[44765]: kernel time sync status change 2001
Jan  7 14:26:02 host1 snmpd[4842]: Received TERM or STOP signal...  shutting 
down...
Jan  7 14:26:02 host1 kernel: netlink: 12 bytes leftover after parsing 
attributes.
Jan  7 14:26:02 host1 snmpd[45667]: NET-SNMP version 5.5
Jan  7 14:52:48 host1 ntpd[44765]: no servers reachable


It looks likely that NameNode received the SIGTERM signal sent by stopping ntpd 
command.
Up to now, the problem has happened three times repeatedly, the time point was 
Jan  7 14:00:00, Jan 14 14:00:00 and Feb  4 14:00:00 respectively.
Although the script to synchronize time is a little improper, and i also know 
the correct synchronized way. but i wonder why NameNode can receive the SIGTERM 
signal sent by stopping ntpd command? and why three times all happened at 
14:00:00?
Any ideas can be appreciated.

Reply via email to