Hi all,

I'm running a hadoop cluster with 24 servers. It has been running for some 
months, but after the last reboot the datanodes keep dying with the error:


2016-02-05 11:35:56,615 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 
/192.168.0.133:50010, dest: /192.168.0.133:40786, bytes: 118143861, op: 
HDFS_READ, cliID: 
DFSClient_attempt_1454667838939_0001_m_000330_0_-1595784897_1, offset: 0, 
srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: 
BP-2025286576-192.168.0.93-1414492170010:blk_1076219758_2486790, duration: 
21719288540
2016-02-05 11:35:56,755 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 
/192.168.0.133:50010, dest: /192.168.0.133:40784, bytes: 118297616, op: 
HDFS_READ, cliID: 
DFSClient_attempt_1454667838939_0001_m_000231_0_-1089799971_1, offset: 0, 
srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: 
BP-2025286576-192.168.0.93-1414492170010:blk_1076221376_2488408, duration: 
22149605332
2016-02-05 11:35:56,837 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 
/192.168.0.133:50010, dest: /192.168.0.133:40780, bytes: 118345914, op: 
HDFS_READ, cliID: 
DFSClient_attempt_1454667838939_0001_m_000208_0_-2005378882_1, offset: 0, 
srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: 
BP-2025286576-192.168.0.93-1414492170010:blk_1076231364_2498422, duration: 
22460210591
2016-02-05 11:35:57,359 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 
/192.168.0.133:50010, dest: /192.168.0.133:40781, bytes: 118419792, op: 
HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000184_0_406014429_1, 
offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: 
BP-2025286576-192.168.0.93-1414492170010:blk_1076221071_2488103, duration: 
22978732747
2016-02-05 11:35:58,008 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 
/192.168.0.133:50010, dest: /192.168.0.133:40787, bytes: 118151696, op: 
HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000324_0_-608122320_1, 
offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: 
BP-2025286576-192.168.0.93-1414492170010:blk_1076222362_2489394, duration: 
23063230631
2016-02-05 11:36:00,295 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 
/192.168.0.133:50010, dest: /192.168.0.133:40776, bytes: 123206293, op: 
HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000015_0_-846180274_1, 
offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: 
BP-2025286576-192.168.0.93-1414492170010:blk_1076244668_2511731, duration: 
26044953281
2016-02-05 11:36:00,407 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 
/192.168.0.133:50010, dest: /192.168.0.133:40764, bytes: 123310419, op: 
HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000010_0_-310980548_1, 
offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: 
BP-2025286576-192.168.0.93-1414492170010:blk_1076244751_2511814, duration: 
26288883806
2016-02-05 11:36:01,371 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 
/192.168.0.133:50010, dest: /192.168.0.133:40783, bytes: 119653309, op: 
HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000055_0_-558109635_1, 
offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: 
BP-2025286576-192.168.0.93-1414492170010:blk_1076222182_2489214, duration: 
26808381782
2016-02-05 11:36:05,224 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
RECEIVED SIGNAL 15: SIGTERM
2016-02-05 11:36:05,230 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down DataNode at computer75/192.168.0.133
************************************************************/


every time I restart the cluster it starts well, with all the nodes on. but 
after some seconds running a map reduce job some nodes die with that error. 
Every time the dead nodes are different.

Do you have any idea of what is happening? I'm using Hadoop 2.4.1, and as I 
told, the cluster has been running before for months without problems.


I cannot find any error in the logs before it receives the SIGTERM.


Moreover, I tried using Spark and it seems to work (I analyze and save about 
100Gb without problems), and the fsck report that the HDFS is ok. Nevetheless, 
in a normal map-reduce job the maps start failing (not all of them, some of 
them finish correctly).


Any idea on how to solve it?



Thanks.

Reply via email to