On Thu, Jan 6, 2011 at 5:46 AM, Wayne <[email protected]> wrote:
> I had another node go down last night. No load at the time, it just seems it
> had issues and shut itself down. Any help would be greatly appreciated. Why
> would the file system go away?

Incorrect ulimit or xceivers or it was shutdown from under hbase.


> Here is a sampling of the logs. Please let me know what other information is
> needed to help figure this out.
>
> This is the first sign of trouble.
>
> 07:47:51,079 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream
> ResponseProcessor exception  for block
> blk_-5666628798929448999_724103java.net.SocketTimeoutException: 69000
> millis timeout while waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel
>
> 07:47:51,080 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for
> block blk_-5666628798929448999_724103 bad datanode[0]
>
> It fails 6 recovery attempts each on 3 nodes including itself for the same
> block
>

Check datanode logs, in particular those participating in this blocks
replication.  They may tell you something.  Grepping the block in the
namenode log can also be revealing.




> 2011-01-06 07:48:42,075 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> java.io.IOException: Error Recovery for block
> blk_-5666628798929448999_724103 failed  because recovery from primary
> datanode x.x.x.x8:50010 failed 6 times.  Pipeline was x.x.x.x8:50010.
> Aborting...
>        at 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2741)
>        at 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1500(DFSClient.java:2172)
>        at 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2371)
> 2011-01-06 07:48:42,077 FATAL
> org.apache.hadoop.hbase.regionserver.wal.HLog: Could not append.
> Requesting close of hlog
>


Thats as though you were running with one replica only, or the other
replicas had died on the dfsclient and had been marked bad.

The regionserver will shut itself down if it can't write the
filesystem to protect against data loss.

>
> Finally it gives up and we see the message below.
>
> 2011-01-06 07:48:42,233 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: Aborting region
> server serverName=x.x.x.x7,60020,1294257428578, load=(requests=76,
> regions=320, usedHeap=3511, maxHeap=7987): File System not available
> java.io.IOException: File system is not available
>

Yeah, as far as its concerned, the FS is hosed.


What Hadoop are you running on top of (Sorry, you've probably already
said in earlier emails).


> I am not sure if it is related but I see java.net.SocketTimeoutException
> happening a lot on most of the nodes in the hadoop log (not at the same time
> as the node going down). I see some recent activity related to this, but I
> am not sure if it has anything to do with our problems. I also see errors
> like the one below, may be totally unrelated but as I look closely at the
> logs I see these things for the first time.
>
> 05:45:33,059 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:
> DatanodeRegistration(x.x.x.x8:50010,
> storageID=DS-201495735-x.x.x.x8-50010-1293567644535, infoPort=50075,
> ipcPort=50020):DataXceiver
> java.io.IOException: Block blk_3051235367015871067_707839 is not valid.
>

If you grep this log in namenode log, does it tell you anything?

Thanks Wayne,
St.Ack


>
> Thanks in advance for any assistance anyone can provide.
>

Reply via email to