You have a lot of memory. Do you have some metrics of the failing servers? Maybe there are "stuck" on a big carbage collector process?
Just my 2ยข... JM 2012/12/4, Arati Patro <[email protected]>: > Hi, > > I'm using hbase version 0.94.1 and hadoop version 1.0.3 > > I'm running HBase + HDFS on a 4 node cluster (48 GB RAM, 12TB DiskSpace on > each node). > > 1 HMaster + NameNode and > 3 HRegionServer + DataNode > > Replication is set to 2 > > Running 6 MapReduce jobs (two of which run concurrently) > > 2012-12-03 01:54:13,444 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( > 10.63.63.249:50010, > storageID=DS-1323881041-10.63.63.249-50010-1354010820987, infoPort=50075, > ipcPort=50020):DataXceiver > java.net.SocketTimeoutException: 480000 millis timeout while waiting for > channel to be ready for write. ch : > java.nio.channels.SocketChannel[connected local=/10.63.63.249:50010 > remote=/ > 10.63.63.249:52264] > at > org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246) > at > org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159) > at > org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:350) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:436) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:197) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99) > at java.lang.Thread.run(Thread.java:619) > > > Any idea what could be causing this? > > Thanks, > > Arati Patro >
