Thanks J-D. 

Well, doen't the following message imply HDFS could accept writes when it has 
at least 1 data node available? 

> error: java.io.IOException: File 
> /hbase/Object_Speed_Test/1dbc1bf84b48e1145638b3a3bc3ad1cd/.tmp/1275904589980700621
>  could only be replicated to 0 nodes, instead of 1

Also it's strange that the region servers got corrupted reads when there are 
two more replicase available on HDFS. 

I'll try to reproduce this when I get some spare time.

Thanks, 

--
Tatsuya Kawano
Tokyo, Japan


On Mar 5, 2011, at 11:36 AM, Jean-Daniel Cryans <[email protected]> wrote:

> It looks like the NN isn't able to create new files:
> 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Not able to place
> enough replicas, still in need of 2
> 
> Which makes sense right? So HBase freaks out because it's not able to flush.
> 
> BTW testing on 3 nodes with dfs rep set to 3 to me is like testing
> just one node, and it's not worth it.
> 
> J-D
> 
> On Fri, Mar 4, 2011 at 5:38 PM, Tatsuya Kawano <[email protected]> wrote:
>> Hi,
>> 
>> I got this question at Hadoop User Group Japan mailing list, but I
>> need some helps from the experts here. It looks like HDFS issue, maybe
>> "append" related?  but I'm not totally sure yet.
>> 
>> The person who posted the original question is testing HA features in
>> HBase 0.90.0 and ASF Hadoop 0.20.2 (with
>> hadoop-core-0.20-append-r1056497.jar)
>> 
>> His test cluster has only 3 nodes.
>> 
>> Node 1: RS, DN, ZK   plus   HM, NN
>> Node 2: RS, DN, ZK
>> Node 3: RS, DN, ZK
>> 
>> dfs.replication = 3
>> 
>> 
>> He brought down Node 3 (which was handling Put requests from his test
>> client) by a kernel panic ("echo c > /proc/sysrq-trigger"). But he
>> also got Region Servers on Node 1 and Node 2 down with the following
>> message.
>> 
>> ---------------------------------------------------------------------
>> 2011-03-01 23:13:13,056 FATAL
>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
>> server serverName=ap12.secur2,60020,1298987576087, load=(requests=0,
>> regions=4, usedHeap=218, maxHeap=1998): Replay of HLog required.
>> Forcing server shutdown
>> org.apache.hadoop.hbase.DroppedSnapshotException: region:
>> Object_Speed_Test,
>> 5003017357526424133520110201051038918,1298988549775.1dbc1bf84b48e1145638b3a3bc3ad1cd
>> ---------------------------------------------------------------------
>> 
>> He can easily reproduce this issue on his cluster.
>> 
>> So, by looking at the above message, I thought there was something
>> wrong with HDFS, and RS was reading corrupted HFile or something from
>> HDFS.
>> 
>> Then, we checked HDFS NN and DN logs, and it seems NN was confused and
>> it wasn't able to allocate block for write.
>> 
>> ---------------------------------------------------------------------
>> 2011-03-01 23:13:13,006 INFO
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit:
>> ugi=hbase,hadoop        ip=/XX.XX.XX.XX   cmd=create      src=/hbase/
>> Object_Speed_Test/1dbc1bf84b48e1145638b3a3bc3ad1cd/.tmp/
>> 1275904589980700621    dst=null        perm=hbase:supergroup:rw-r--r--
>> 2011-03-01 23:13:13,048 INFO org.apache.hadoop.ipc.Server: IPC Server
>> handler 2 on 9000, call addBlock(/hbase/Object_Speed_Test/
>> 1dbc1bf84b48e1145638b3a3bc3ad1cd/.tmp/1275904589980700621,
>> DFSClient_hb_rs_ap12.secur2,60020,1298987576087_1298987617433, null)
>> from XX.XX.XX.XX:55462: error: java.io.IOException: File /hbase/
>> Object_Speed_Test/1dbc1bf84b48e1145638b3a3bc3ad1cd/.tmp/
>> 1275904589980700621 could only be replicated to 0 nodes, instead of 1
>> java.io.IOException: File /hbase/Object_Speed_Test/
>> 1dbc1bf84b48e1145638b3a3bc3ad1cd/.tmp/1275904589980700621 could only
>> be replicated to 0 nodes, instead of 1
>> ---------------------------------------------------------------------
>> 
>> It seems the kernel panic on Node 3 put HDFS in a wrong state, so
>> Region Servers couldn't write to and read from HDFS and had to shut
>> themselves down.
>> 
>> We couldn't find any more clues in the logs, but I pasted them here:
>> 
>> http://pastebin.com/NYkNS1c1
>> 
>> 
>> Since dfs.replication = 3, all Data Nodes were participating HLog
>> write at the time Node 3 got the kernel panic. I think this somehow
>> made the Name Node to think those Data Nodes were all gone. But I
>> couldn't find the root cause of this issue.
>> 
>> Also, he checked the network and disk spaces, and he believes there
>> was no issue on them when he was testing.
>> 
>> Thanks,
>> 
>> --
>> Tatsuya Kawano
>> Tokyo, Japan
>> 

Reply via email to