I would spot on Jan 7 14:52:48 host1 ntpd[44765]: no servers reachable
looks for me like an network / DNS issue. You can check per dmesg whats going on, too. BR - Alexander > On 09 Feb 2015, at 17:57, daemeon reiydelle <[email protected]> wrote: > > Absolutely a critical error to lose the configured ntpd time source in > Hadoop. The replication and many other services require absolutely > millisecond time sync between the nodes. Interesting that your SRE design > called for ntpd running on each node. Curious. > > What is the problem you are trying to solve by stopping ntpd on the local > host? Did someone not understand how ntpd works? Did someone configure it to > (I sure hope not) be free running? > > > > ....... > “Life should not be a journey to the grave with the intention of arriving > safely in a > pretty and well preserved body, but rather to skid in broadside in a cloud of > smoke, > thoroughly used up, totally worn out, and loudly proclaiming “Wow! What a > Ride!” > - Hunter Thompson > > Daemeon C.M. Reiydelle > USA (+1) 415.501.0198 > London (+44) (0) 20 8144 9872 > > On Sun, Feb 8, 2015 at 7:30 PM, David chen <[email protected] > <mailto:[email protected]>> wrote: > A shell script is deployed on every node of HDFS cluster, the script is > invoked hourly by crontab, and its content is as follows: > #!/bin/bash > service ntpd stop > ntpdate 192.168.0.1 #it's a valid ntpd server in LAN > service ntpd start > chkconfig ntpd on > > After several days, NameNode crashed suddenly, but its log seemed no other > errors except the following: > 2015-01-07 14:00:00,709 ERROR > org.apache.hadoop.hdfs.server.namenode.NameNode: RECEIVED SIGNAL 15: SIGTERM > > Inspected the Linux log(Centos /var/log/messages), also found the following > clues: > Jan 7 14:00:01 host1 ntpd[32101]: ntpd exiting on signal 15 > Jan 7 13:59:59 host1 ntpd[44764]: ntpd [email protected] Fri Feb 22 11:23:27 > UTC 2013 (1) > Jan 7 13:59:59 host1 ntpd[44765]: precision = 0.143 usec > Jan 7 13:59:59 host1 ntpd[44765]: Listening on interface #0 wildcard, > 0.0.0.0#123 Disabled > Jan 7 13:59:59 host1 ntpd[44765]: Listening on interface #1 wildcard, ::#123 > Disabled > Jan 7 13:59:59 host1 ntpd[44765]: Listening on interface #2 lo, ::1#123 > Enabled > Jan 7 13:59:59 host1 ntpd[44765]: Listening on interface #3 em2, > fe80::ca1f:66ff:fee1:eed#123 Enabled > Jan 7 13:59:59 host1 ntpd[44765]: Listening on interface #4 lo, > 127.0.0.1#123 Enabled > Jan 7 13:59:59 host1 ntpd[44765]: Listening on interface #5 em2, > 192.168.1.151#123 Enabled > Jan 7 13:59:59 host1 ntpd[44765]: Listening on routing socket on fd #22 for > interface updates > Jan 7 13:59:59 host1 ntpd[44765]: kernel time sync status 2040 > Jan 7 13:59:59 host1 ntpd[44765]: frequency initialized 499.399 PPM from > /var/lib/ntp/drift > Jan 7 14:00:01 host1 ntpd_initres[32103]: parent died before we finished, > exiting > Jan 7 14:04:17 host1 ntpd[44765]: synchronized to 192.168.0.191, stratum 2 > Jan 7 14:04:17 host1 ntpd[44765]: kernel time sync status change 2001 > Jan 7 14:26:02 host1 snmpd[4842]: Received TERM or STOP signal... shutting > down... > Jan 7 14:26:02 host1 kernel: netlink: 12 bytes leftover after parsing > attributes. > Jan 7 14:26:02 host1 snmpd[45667]: NET-SNMP version 5.5 > Jan 7 14:52:48 host1 ntpd[44765]: no servers reachable > > It looks likely that NameNode received the SIGTERM signal sent by stopping > ntpd command. > Up to now, the problem has happened three times repeatedly, the time point > was Jan 7 14:00:00, Jan 14 14:00:00 and Feb 4 14:00:00 respectively. > Although the script to synchronize time is a little improper, and i also know > the correct synchronized way. but i wonder why NameNode can receive the > SIGTERM signal sent by stopping ntpd command? and why three times all > happened at 14:00:00? > Any ideas can be appreciated. >
