It looks like the NN isn't able to create new files: org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Not able to place enough replicas, still in need of 2
Which makes sense right? So HBase freaks out because it's not able to flush. BTW testing on 3 nodes with dfs rep set to 3 to me is like testing just one node, and it's not worth it. J-D On Fri, Mar 4, 2011 at 5:38 PM, Tatsuya Kawano <[email protected]> wrote: > Hi, > > I got this question at Hadoop User Group Japan mailing list, but I > need some helps from the experts here. It looks like HDFS issue, maybe > "append" related? but I'm not totally sure yet. > > The person who posted the original question is testing HA features in > HBase 0.90.0 and ASF Hadoop 0.20.2 (with > hadoop-core-0.20-append-r1056497.jar) > > His test cluster has only 3 nodes. > > Node 1: RS, DN, ZK plus HM, NN > Node 2: RS, DN, ZK > Node 3: RS, DN, ZK > > dfs.replication = 3 > > > He brought down Node 3 (which was handling Put requests from his test > client) by a kernel panic ("echo c > /proc/sysrq-trigger"). But he > also got Region Servers on Node 1 and Node 2 down with the following > message. > > --------------------------------------------------------------------- > 2011-03-01 23:13:13,056 FATAL > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region > server serverName=ap12.secur2,60020,1298987576087, load=(requests=0, > regions=4, usedHeap=218, maxHeap=1998): Replay of HLog required. > Forcing server shutdown > org.apache.hadoop.hbase.DroppedSnapshotException: region: > Object_Speed_Test, > 5003017357526424133520110201051038918,1298988549775.1dbc1bf84b48e1145638b3a3bc3ad1cd > --------------------------------------------------------------------- > > He can easily reproduce this issue on his cluster. > > So, by looking at the above message, I thought there was something > wrong with HDFS, and RS was reading corrupted HFile or something from > HDFS. > > Then, we checked HDFS NN and DN logs, and it seems NN was confused and > it wasn't able to allocate block for write. > > --------------------------------------------------------------------- > 2011-03-01 23:13:13,006 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: > ugi=hbase,hadoop ip=/XX.XX.XX.XX cmd=create src=/hbase/ > Object_Speed_Test/1dbc1bf84b48e1145638b3a3bc3ad1cd/.tmp/ > 1275904589980700621 dst=null perm=hbase:supergroup:rw-r--r-- > 2011-03-01 23:13:13,048 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 2 on 9000, call addBlock(/hbase/Object_Speed_Test/ > 1dbc1bf84b48e1145638b3a3bc3ad1cd/.tmp/1275904589980700621, > DFSClient_hb_rs_ap12.secur2,60020,1298987576087_1298987617433, null) > from XX.XX.XX.XX:55462: error: java.io.IOException: File /hbase/ > Object_Speed_Test/1dbc1bf84b48e1145638b3a3bc3ad1cd/.tmp/ > 1275904589980700621 could only be replicated to 0 nodes, instead of 1 > java.io.IOException: File /hbase/Object_Speed_Test/ > 1dbc1bf84b48e1145638b3a3bc3ad1cd/.tmp/1275904589980700621 could only > be replicated to 0 nodes, instead of 1 > --------------------------------------------------------------------- > > It seems the kernel panic on Node 3 put HDFS in a wrong state, so > Region Servers couldn't write to and read from HDFS and had to shut > themselves down. > > We couldn't find any more clues in the logs, but I pasted them here: > > http://pastebin.com/NYkNS1c1 > > > Since dfs.replication = 3, all Data Nodes were participating HLog > write at the time Node 3 got the kernel panic. I think this somehow > made the Name Node to think those Data Nodes were all gone. But I > couldn't find the root cause of this issue. > > Also, he checked the network and disk spaces, and he believes there > was no issue on them when he was testing. > > Thanks, > > -- > Tatsuya Kawano > Tokyo, Japan >
