Hi Users, We are running following code.
Hbase version : 0.90.3 with HBASE-3777, HBASE-2937 and HBASE-3855 on top Hadoop version: CDH3B3 I am trying to figure the right way to perform cluster restart in case we want to push a patched jar or a configuration tweak. I have tried http://wiki.apache.org/hadoop/Hbase/RollingRestart among other things (reverse the order of process restart as described in rolling restart). But we always end up facing following error. We can not use the rolling restart script as ssh for user running hbase is not configured right. I have tried emulating the steps in the script. It didn't help. Its fairly easy to reproduce on production cluster. I have not been able to reproduce on staging instance though. The data on staging is much less, hence recovery might be taking fairly small time - just a guess. There was nothing interesting going on in logs of RS running at bond0.ine-46.dummy.net. The way we move pass this situation is by killing the RS causing contention. How other are handling the cluster restart? 2011-06-28 21:25:41,439 INFO org.apache.hadoop.hbase.master.ServerManager: Registering server=bond0.ine-52.dummy.net,60020,1309310584702, regionCount=0, userLoad=true 2011-06-28 21:25:41,445 INFO org.apache.hadoop.hbase.master.ServerManager: Registering server=bond0.ine-45.dummy.net,60020,1309310343488, regionCount=0, userLoad=true 2011-06-28 21:25:41,471 INFO org.apache.hadoop.hbase.master.ServerManager: Registering server=bond0.ine-47.dummy.net,60020,1309309545442, regionCount=0, userLoad=true 2011-06-28 21:25:42,210 INFO org.apache.hadoop.hbase.master.ServerManager: Waiting on regionserver(s) count to settle; currently=4 2011-06-28 21:25:43,712 INFO org.apache.hadoop.hbase.master.ServerManager: Finished waiting for regionserver count to settle; count=4, sleptFor=4500 2011-06-28 21:25:43,712 INFO org.apache.hadoop.hbase.master.ServerManager: Exiting wait on regionserver(s) to checkin; count=4, stopped=false, count of regions out on cluster=0 2011-06-28 21:25:43,718 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://master-hadoop.ine-arp.dummy.net:8020/hbase/ine-arp/.logs/bond0.ine-45.dummy.net,60020,1309310343488 belongs to an existing region server 2011-06-28 21:25:43,719 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://master-hadoop.ine-arp.dummy.net:8020/hbase/ine-arp/.logs/bond0.ine-46.dummy.net,60020,1309310350732 doesn't belong to a known region server, splitting 2011-06-28 21:25:43,730 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting 1 hlog(s) in hdfs://master-hadoop.ine-arp.dummy.net:8020/hbase/ine-arp/.logs/bond0.ine-46.dummy.net,60020,1309310350732 2011-06-28 21:25:43,735 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering file hdfs://master-hadoop.ine-arp.dummy.net:8020/hbase/ine-arp/.logs/bond0.ine-46.dummy.net,60020,1309310350732/bond0.ine-46.dummy.net%3A60020.1309310351513 2011-06-28 21:26:43,976 WARN org.apache.hadoop.hbase.util.FSUtils: Waited 60241ms for lease recovery on hdfs://master-hadoop.ine-arp.dummy.net:8020/hbase/ine-arp/.logs/bond0.ine-46.dummy.net,60020,1309310350732/bond0.ine-46.dummy.net%3A60020.1309310351513:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /hbase/ine-arp/.logs/bond0.ine-46.dummy.net,60020,1309310350732/bond0.ine-46.dummy.net%3A60020.1309310351513 for DFSClient_hb_m_bond0.ine-54.dummy.net:60000_1309310737997 on client 172.22.2.54, because this file is already being created by DFSClient_hb_rs_bond0.ine-46.dummy.net,60020,1309310350732_1309310351430 on 172.22.2.46 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1194) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1282) at org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:541) at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:528) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1319) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1315) at java.security.AccessController.doPrivileged(Native Method)
