It looks like the NN isn't able to create new files:

org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Not able to place
enough replicas, still in need of 2

Which makes sense right? So HBase freaks out because it's not able to flush.

BTW testing on 3 nodes with dfs rep set to 3 to me is like testing
just one node, and it's not worth it.

J-D

On Fri, Mar 4, 2011 at 5:38 PM, Tatsuya Kawano <[email protected]> wrote:
> Hi,
>
> I got this question at Hadoop User Group Japan mailing list, but I
> need some helps from the experts here. It looks like HDFS issue, maybe
> "append" related?  but I'm not totally sure yet.
>
> The person who posted the original question is testing HA features in
> HBase 0.90.0 and ASF Hadoop 0.20.2 (with
> hadoop-core-0.20-append-r1056497.jar)
>
> His test cluster has only 3 nodes.
>
> Node 1: RS, DN, ZK   plus   HM, NN
> Node 2: RS, DN, ZK
> Node 3: RS, DN, ZK
>
> dfs.replication = 3
>
>
> He brought down Node 3 (which was handling Put requests from his test
> client) by a kernel panic ("echo c > /proc/sysrq-trigger"). But he
> also got Region Servers on Node 1 and Node 2 down with the following
> message.
>
> ---------------------------------------------------------------------
> 2011-03-01 23:13:13,056 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
> server serverName=ap12.secur2,60020,1298987576087, load=(requests=0,
> regions=4, usedHeap=218, maxHeap=1998): Replay of HLog required.
> Forcing server shutdown
> org.apache.hadoop.hbase.DroppedSnapshotException: region:
> Object_Speed_Test,
> 5003017357526424133520110201051038918,1298988549775.1dbc1bf84b48e1145638b3a3bc3ad1cd
> ---------------------------------------------------------------------
>
> He can easily reproduce this issue on his cluster.
>
> So, by looking at the above message, I thought there was something
> wrong with HDFS, and RS was reading corrupted HFile or something from
> HDFS.
>
> Then, we checked HDFS NN and DN logs, and it seems NN was confused and
> it wasn't able to allocate block for write.
>
> ---------------------------------------------------------------------
> 2011-03-01 23:13:13,006 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit:
> ugi=hbase,hadoop        ip=/XX.XX.XX.XX   cmd=create      src=/hbase/
> Object_Speed_Test/1dbc1bf84b48e1145638b3a3bc3ad1cd/.tmp/
> 1275904589980700621    dst=null        perm=hbase:supergroup:rw-r--r--
> 2011-03-01 23:13:13,048 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 2 on 9000, call addBlock(/hbase/Object_Speed_Test/
> 1dbc1bf84b48e1145638b3a3bc3ad1cd/.tmp/1275904589980700621,
> DFSClient_hb_rs_ap12.secur2,60020,1298987576087_1298987617433, null)
> from XX.XX.XX.XX:55462: error: java.io.IOException: File /hbase/
> Object_Speed_Test/1dbc1bf84b48e1145638b3a3bc3ad1cd/.tmp/
> 1275904589980700621 could only be replicated to 0 nodes, instead of 1
> java.io.IOException: File /hbase/Object_Speed_Test/
> 1dbc1bf84b48e1145638b3a3bc3ad1cd/.tmp/1275904589980700621 could only
> be replicated to 0 nodes, instead of 1
> ---------------------------------------------------------------------
>
> It seems the kernel panic on Node 3 put HDFS in a wrong state, so
> Region Servers couldn't write to and read from HDFS and had to shut
> themselves down.
>
> We couldn't find any more clues in the logs, but I pasted them here:
>
> http://pastebin.com/NYkNS1c1
>
>
> Since dfs.replication = 3, all Data Nodes were participating HLog
> write at the time Node 3 got the kernel panic. I think this somehow
> made the Name Node to think those Data Nodes were all gone. But I
> couldn't find the root cause of this issue.
>
> Also, he checked the network and disk spaces, and he believes there
> was no issue on them when he was testing.
>
> Thanks,
>
> --
> Tatsuya Kawano
> Tokyo, Japan
>

Reply via email to