I had another node go down last night. No load at the time, it just seems it
had issues and shut itself down. Any help would be greatly appreciated. Why
would the file system go away? Is this an hadoop problem or a hardware
problem or ??

Here is a sampling of the logs. Please let me know what other information is
needed to help figure this out.

This is the first sign of trouble.

07:47:51,079 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream
ResponseProcessor exception  for block
blk_-5666628798929448999_724103java.net.SocketTimeoutException: 69000
millis timeout while waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel

07:47:51,080 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for
block blk_-5666628798929448999_724103 bad datanode[0]

It fails 6 recovery attempts each on 3 nodes including itself for the same
block

07:47:51,129 WARN org.apache.hadoop.hdfs.DFSClient: Failed recovery
attempt #0 from primary datanode x.x.x.x6:50010
org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.ipc.RemoteException: java.io.IOException:
blk_-5666628798929448999_724103 is already commited, storedBlock ==
null.

Then it logs 30+ fatals errors regarding the WAL.

2011-01-06 07:48:42,075 ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer:
java.io.IOException: Error Recovery for block
blk_-5666628798929448999_724103 failed  because recovery from primary
datanode x.x.x.x8:50010 failed 6 times.  Pipeline was x.x.x.x8:50010.
Aborting...
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2741)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1500(DFSClient.java:2172)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2371)
2011-01-06 07:48:42,077 FATAL
org.apache.hadoop.hbase.regionserver.wal.HLog: Could not append.
Requesting close of hlog


Finally it gives up and we see the message below.

2011-01-06 07:48:42,233 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: Aborting region
server serverName=x.x.x.x7,60020,1294257428578, load=(requests=76,
regions=320, usedHeap=3511, maxHeap=7987): File System not available
java.io.IOException: File system is not available


In case it helps the hadoop logs have the following showing up 6x.

07:48:30,936 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2
on 50020, call recoverBlock(blk_-5666628798929448999_724103, false,
[Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@3d40e2db) from
x.x.x.x7:42652: error: org.apache.hadoop.ipc.RemoteException:
java.io.IOException: blk_-5666628798929448999_724103 is already
commited, storedBlock == null.


The other nodes it was trying to talk to had no hbase log entries of
interest but their hadoop logs looked like the following 6x.

2011-01-06 07:48:37,013 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 2 on 50020, call recoverBlock(blk_-5666628798929448999_724103,
false, [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@58c2ebf) from
x.x.x.x7:42979: error: org.apache.hadoop.ipc.RemoteException:
java.io.IOException: blk_-5666628798929448999_724103 is already
commited, storedBlock == null.


I am not sure if it is related but I see java.net.SocketTimeoutException
happening a lot on most of the nodes in the hadoop log (not at the same time
as the node going down). I see some recent activity related to this, but I
am not sure if it has anything to do with our problems. I also see errors
like the one below, may be totally unrelated but as I look closely at the
logs I see these things for the first time.

05:45:33,059 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(x.x.x.x8:50010,
storageID=DS-201495735-x.x.x.x8-50010-1293567644535, infoPort=50075,
ipcPort=50020):DataXceiver
java.io.IOException: Block blk_3051235367015871067_707839 is not valid.


Thanks in advance for any assistance anyone can provide.

Reply via email to