Thanks J-D. Well, doen't the following message imply HDFS could accept writes when it has at least 1 data node available?
> error: java.io.IOException: File > /hbase/Object_Speed_Test/1dbc1bf84b48e1145638b3a3bc3ad1cd/.tmp/1275904589980700621 > could only be replicated to 0 nodes, instead of 1 Also it's strange that the region servers got corrupted reads when there are two more replicase available on HDFS. I'll try to reproduce this when I get some spare time. Thanks, -- Tatsuya Kawano Tokyo, Japan On Mar 5, 2011, at 11:36 AM, Jean-Daniel Cryans <[email protected]> wrote: > It looks like the NN isn't able to create new files: > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Not able to place > enough replicas, still in need of 2 > > Which makes sense right? So HBase freaks out because it's not able to flush. > > BTW testing on 3 nodes with dfs rep set to 3 to me is like testing > just one node, and it's not worth it. > > J-D > > On Fri, Mar 4, 2011 at 5:38 PM, Tatsuya Kawano <[email protected]> wrote: >> Hi, >> >> I got this question at Hadoop User Group Japan mailing list, but I >> need some helps from the experts here. It looks like HDFS issue, maybe >> "append" related? but I'm not totally sure yet. >> >> The person who posted the original question is testing HA features in >> HBase 0.90.0 and ASF Hadoop 0.20.2 (with >> hadoop-core-0.20-append-r1056497.jar) >> >> His test cluster has only 3 nodes. >> >> Node 1: RS, DN, ZK plus HM, NN >> Node 2: RS, DN, ZK >> Node 3: RS, DN, ZK >> >> dfs.replication = 3 >> >> >> He brought down Node 3 (which was handling Put requests from his test >> client) by a kernel panic ("echo c > /proc/sysrq-trigger"). But he >> also got Region Servers on Node 1 and Node 2 down with the following >> message. >> >> --------------------------------------------------------------------- >> 2011-03-01 23:13:13,056 FATAL >> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region >> server serverName=ap12.secur2,60020,1298987576087, load=(requests=0, >> regions=4, usedHeap=218, maxHeap=1998): Replay of HLog required. >> Forcing server shutdown >> org.apache.hadoop.hbase.DroppedSnapshotException: region: >> Object_Speed_Test, >> 5003017357526424133520110201051038918,1298988549775.1dbc1bf84b48e1145638b3a3bc3ad1cd >> --------------------------------------------------------------------- >> >> He can easily reproduce this issue on his cluster. >> >> So, by looking at the above message, I thought there was something >> wrong with HDFS, and RS was reading corrupted HFile or something from >> HDFS. >> >> Then, we checked HDFS NN and DN logs, and it seems NN was confused and >> it wasn't able to allocate block for write. >> >> --------------------------------------------------------------------- >> 2011-03-01 23:13:13,006 INFO >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: >> ugi=hbase,hadoop ip=/XX.XX.XX.XX cmd=create src=/hbase/ >> Object_Speed_Test/1dbc1bf84b48e1145638b3a3bc3ad1cd/.tmp/ >> 1275904589980700621 dst=null perm=hbase:supergroup:rw-r--r-- >> 2011-03-01 23:13:13,048 INFO org.apache.hadoop.ipc.Server: IPC Server >> handler 2 on 9000, call addBlock(/hbase/Object_Speed_Test/ >> 1dbc1bf84b48e1145638b3a3bc3ad1cd/.tmp/1275904589980700621, >> DFSClient_hb_rs_ap12.secur2,60020,1298987576087_1298987617433, null) >> from XX.XX.XX.XX:55462: error: java.io.IOException: File /hbase/ >> Object_Speed_Test/1dbc1bf84b48e1145638b3a3bc3ad1cd/.tmp/ >> 1275904589980700621 could only be replicated to 0 nodes, instead of 1 >> java.io.IOException: File /hbase/Object_Speed_Test/ >> 1dbc1bf84b48e1145638b3a3bc3ad1cd/.tmp/1275904589980700621 could only >> be replicated to 0 nodes, instead of 1 >> --------------------------------------------------------------------- >> >> It seems the kernel panic on Node 3 put HDFS in a wrong state, so >> Region Servers couldn't write to and read from HDFS and had to shut >> themselves down. >> >> We couldn't find any more clues in the logs, but I pasted them here: >> >> http://pastebin.com/NYkNS1c1 >> >> >> Since dfs.replication = 3, all Data Nodes were participating HLog >> write at the time Node 3 got the kernel panic. I think this somehow >> made the Name Node to think those Data Nodes were all gone. But I >> couldn't find the root cause of this issue. >> >> Also, he checked the network and disk spaces, and he believes there >> was no issue on them when he was testing. >> >> Thanks, >> >> -- >> Tatsuya Kawano >> Tokyo, Japan >>
