A shell script is deployed on every node of HDFS cluster, the script is invoked hourly by crontab, and its content is as follows: #!/bin/bash service ntpd stop ntpdate 192.168.0.1 #it's a valid ntpd server in LAN service ntpd start chkconfig ntpd on
After several days, NameNode crashed suddenly, but its log seemed no other errors except the following: 2015-01-07 14:00:00,709 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: RECEIVED SIGNAL 15: SIGTERM Inspected the Linux log(Centos /var/log/messages), also found the following clues: Jan 7 14:00:01 host1 ntpd[32101]: ntpd exiting on signal 15 Jan 7 13:59:59 host1 ntpd[44764]: ntpd [email protected] Fri Feb 22 11:23:27 UTC 2013 (1) Jan 7 13:59:59 host1 ntpd[44765]: precision = 0.143 usec Jan 7 13:59:59 host1 ntpd[44765]: Listening on interface #0 wildcard, 0.0.0.0#123 Disabled Jan 7 13:59:59 host1 ntpd[44765]: Listening on interface #1 wildcard, ::#123 Disabled Jan 7 13:59:59 host1 ntpd[44765]: Listening on interface #2 lo, ::1#123 Enabled Jan 7 13:59:59 host1 ntpd[44765]: Listening on interface #3 em2, fe80::ca1f:66ff:fee1:eed#123 Enabled Jan 7 13:59:59 host1 ntpd[44765]: Listening on interface #4 lo, 127.0.0.1#123 Enabled Jan 7 13:59:59 host1 ntpd[44765]: Listening on interface #5 em2, 192.168.1.151#123 Enabled Jan 7 13:59:59 host1 ntpd[44765]: Listening on routing socket on fd #22 for interface updates Jan 7 13:59:59 host1 ntpd[44765]: kernel time sync status 2040 Jan 7 13:59:59 host1 ntpd[44765]: frequency initialized 499.399 PPM from /var/lib/ntp/drift Jan 7 14:00:01 host1 ntpd_initres[32103]: parent died before we finished, exiting Jan 7 14:04:17 host1 ntpd[44765]: synchronized to 192.168.0.191, stratum 2 Jan 7 14:04:17 host1 ntpd[44765]: kernel time sync status change 2001 Jan 7 14:26:02 host1 snmpd[4842]: Received TERM or STOP signal... shutting down... Jan 7 14:26:02 host1 kernel: netlink: 12 bytes leftover after parsing attributes. Jan 7 14:26:02 host1 snmpd[45667]: NET-SNMP version 5.5 Jan 7 14:52:48 host1 ntpd[44765]: no servers reachable It looks likely that NameNode received the SIGTERM signal sent by stopping ntpd command. Up to now, the problem has happened three times repeatedly, the time point was Jan 7 14:00:00, Jan 14 14:00:00 and Feb 4 14:00:00 respectively. Although the script to synchronize time is a little improper, and i also know the correct synchronized way. but i wonder why NameNode can receive the SIGTERM signal sent by stopping ntpd command? and why three times all happened at 14:00:00? Any ideas can be appreciated.
