Hi, 

I got this question at Hadoop User Group Japan mailing list, but I need some 
helps from the experts here. It looks like HDFS issue, maybe "append" related?  
but I'm not totally sure yet. 

The person who posted the original question is testing HA features in HBase 
0.90.0 and ASF Hadoop 0.20.2 (with hadoop-core-0.20-append-r1056497.jar)

His test cluster has only 3 nodes. 

Node 1: RS, DN, ZK   plus   HM, NN
Node 2: RS, DN, ZK
Node 3: RS, DN, ZK

dfs.replication = 3


He brought down Node 3 (which was handling Put requests from his test client) 
by a kernel panic ("echo c > /proc/sysrq-trigger"). But he also got Region 
Servers on Node 1 and Node 2 down with the following message. 

---------------------------------------------------------------------
2011-03-01 23:13:13,056 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
server serverName=ap12.secur2,60020,1298987576087, load=(requests=0,
regions=4, usedHeap=218, maxHeap=1998): Replay of HLog required.
Forcing server shutdown
org.apache.hadoop.hbase.DroppedSnapshotException: region:
Object_Speed_Test,
5003017357526424133520110201051038918,1298988549775.1dbc1bf84b48e1145638b3a3bc3ad1cd
---------------------------------------------------------------------

He can easily reproduce this issue on his cluster. 

So, by looking at the above message, I thought there was something wrong with 
HDFS, and RS was reading corrupted HFile or something from HDFS. 

Then, we checked HDFS NN and DN logs, and it seems NN was confused and it 
wasn't able to allocate block for write. 

---------------------------------------------------------------------
2011-03-01 23:13:13,006 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit:
ugi=hbase,hadoop        ip=/XX.XX.XX.XX   cmd=create      src=/hbase/
Object_Speed_Test/1dbc1bf84b48e1145638b3a3bc3ad1cd/.tmp/
1275904589980700621    dst=null        perm=hbase:supergroup:rw-r--r--
2011-03-01 23:13:13,048 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 2 on 9000, call addBlock(/hbase/Object_Speed_Test/
1dbc1bf84b48e1145638b3a3bc3ad1cd/.tmp/1275904589980700621,
DFSClient_hb_rs_ap12.secur2,60020,1298987576087_1298987617433, null)
from XX.XX.XX.XX:55462: error: java.io.IOException: File /hbase/
Object_Speed_Test/1dbc1bf84b48e1145638b3a3bc3ad1cd/.tmp/
1275904589980700621 could only be replicated to 0 nodes, instead of 1
java.io.IOException: File /hbase/Object_Speed_Test/
1dbc1bf84b48e1145638b3a3bc3ad1cd/.tmp/1275904589980700621 could only
be replicated to 0 nodes, instead of 1
---------------------------------------------------------------------

It seems the kernel panic on Node 3 put HDFS in a wrong state, so Region 
Servers couldn't write to and read from HDFS and had to shut themselves down. 

We couldn't find any more clues in the logs, but I pasted them here: 

http://pastebin.com/NYkNS1c1


Since dfs.replication = 3, all Data Nodes were participating HLog write at the 
time Node 3 got the kernel panic. I think this somehow made the Name Node to 
think those Data Nodes were all gone. But I couldn't find the root cause of 
this issue. 

Also, he checked the network and disk spaces, and he believes there was no 
issue on them when he was testing. 

Thanks, 

--
Tatsuya Kawano
Tokyo, Japan

Reply via email to