I had a running Hadoop cluster (version 2.2.0.2.0.6.0-76 from Hortonworks). Yesterday a lot of things happened nad in some point of time we decided to one by one reboot all datanodes. Unfortunate the operator did monitor the namenode health monitor.
The result of above operation is that all datanodes shows as dead nodes, all blocked are lost, ... . In one datanode which we decided to reboot it once again to see if datanode will log anything interesting. The log finished with informations: INFO ipc.Server (Server.java:run(861)) - IPC Server Responder: starting INFO ipc.Server (Server.java:run(688)) - IPC Server listener on 8010: starting and hangs here. In the same time on namnode I can see only two types of messages: INFO hdfs.StateChange (FSNamesystem.java:completeFile(2805)) - DIR* completeFile: [SOME PATH] is closed by DFSClient_NONMAPREDUCE_288661168_33 and a lot of: WARN blockmanagement.BlockManager (PendingReplicationBlocks.java:pendingReplicationCheck(249)) - PendingReplicationMonitor timed out blk_1074405820_668233 Today we decided to restart name node and all data nodes. After restart website: http://[server]:50070/dfshealth.jspanswers VERY slow. I don't see any errors in log except 5 like bellow: ERROR datanode.DataNode (DataXceiver.java:run(225)) - maelhd21:50010:DataXceiver error processing WRITE_BLOCK operation src: /node1:33470 dest: /node3:50010 org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-1037132819-192.168.61.196-1409328081083:blk_1075994366_2257020 already exists in state FINALIZED and thus cannot be created. 3 out of 5 nodes shows as lived, but refresh of hadoop status page takes more than 10 minutes. The question of course is: what should I check or do now? p.s. I asked same question on StackOverflow: http://stackoverflow.com/questions/31020877/datanodes-are-cannot-connect-to-namenode
