Thanks you Stack! On Wed, Feb 23, 2011 at 6:25 AM, Stack <[email protected]> wrote:
> On Mon, Feb 21, 2011 at 10:04 PM, Yi Liang <[email protected]> wrote: > > Yes, the server zcl crashed at that time. > > > > But after I restarted it later, it's still in the dead server list. > > > > We failed processing its death: > > 2011-02-18 10:08:14,873 ERROR org.apache.hadoop.hbase.HServerAddress: > Could not resolve the DNS name of zcl.local:60020 > 2011-02-18 10:08:14,874 ERROR > org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while > processing event M_SERVER_SHUTDOWN > java.lang.IllegalArgumentException: Could not resolve the DNS name of > zcl.local:60020 > at > org.apache.hadoop.hbase.HServerAddress.checkBindAddressCanBeResolved(HServerAddress.java:105) > at > org.apache.hadoop.hbase.HServerAddress.<init>(HServerAddress.java:66) > at > org.apache.hadoop.hbase.catalog.MetaReader.metaRowToRegionPairWithInfo(MetaReader.java:407) > at > org.apache.hadoop.hbase.catalog.MetaReader.getServerUserRegions(MetaReader.java:594) > at > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:124) > at > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:151) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:662) > > It looks like the above exception caused us to jump out of the > processing of the server shutdown. Above is related to the no route > to host. > > I filed HBASE-3556. It'll be 'fixed' by HBASE-1501 but we should > never just give up processing. Need to look into that. > > While a server is in the dead servers list, we'll not run the > balancer. The dead servers list is an in-memory list. You'd need to > kill the master and bring it back up again to rid the dead server > state. > > St.Ack > > > > 2011-02-18 10:39:26,895 INFO > org.apache.hadoop.hbase.master.ServerManager: > > Registering server=zcl.local,60020,1297996817352, regionCount=0, > > userLoad=false > > 2011-02-18 10:39:35,062 DEBUG org.apache.hadoop.hbase.master.HMaster: Not > > running balancer because processing dead regionserver(s): > > [Docete.local,60020,1297919410096, liym.local,60020,1297919445796, > > zcl.local,60020,1297919367472] > > > > On Tue, Feb 22, 2011 at 1:48 AM, Ted Yu <[email protected]> wrote: > > > >> Looks like there was connectivity issue: > >> > >> java.net.NoRouteToHostException: No route to host > >> > >> On Sun, Feb 20, 2011 at 10:09 PM, Yi Liang <[email protected]> wrote: > >> > >> > The related log is at: http://pastebin.com/0a1CjDUD > >> > > >> > It's ok now after restarting hbase, but still curious why it happend. > >> > > >> > Thanks, > >> > Yi > >> > On Sat, Feb 19, 2011 at 3:58 AM, Jean-Daniel Cryans < > [email protected] > >> > >wrote: > >> > > >> > > The master should finish processing those dead servers at some point > >> > > and it seems it's not happening? Unfortunately without the log > nobody > >> > > can'tell why. If you can post the complete log in pastebin or put it > >> > > on a web server then we could take a look. > >> > > > >> > > J-D > >> > > > >> > > On Fri, Feb 18, 2011 at 12:39 AM, Yi Liang <[email protected]> > wrote: > >> > > > Hi all, > >> > > > > >> > > > We have a hbase cluster with 10 region servers running HBase > 0.90.0 + > >> > > CDH3. > >> > > > We're now importing big data into HBase. > >> > > > > >> > > > During the process, 2 servers crashed, but after restaring them, > >> > they're > >> > > no > >> > > > longer assigned with any region, while regions on other servers > keep > >> > > > splitting when more data inserted. > >> > > > > >> > > > From the master log, we can see the periodical messages like: > >> > > > > >> > > > 2011-02-18 16:09:35,067 DEBUG > org.apache.hadoop.hbase.master.HMaster: > >> > Not > >> > > > running balancer because processing dead regionserver(s): > >> > > > [zcl.local,60020,1297996817352, qics.local,60020,1297919358488, > >> > > > Docete.local,60020,1297919410096, liym.local,60020,1297919445796, > >> > > > zcl.local,60020,1297919367472] > >> > > > > >> > > > zcl.local and qics.local are the machines we have restared, other > 2 > >> > > machine > >> > > > have kept running without restarting and are actually still > serving > >> > > regions. > >> > > > > >> > > > From the shell status: > >> > > > 10 servers, 5 dead, 10.1000 average Load > >> > > > > >> > > > Why are there dead servers? And how to clear them so we could > start > >> > > > balancer? > >> > > > > >> > > > Thanks, > >> > > > Yi > >> > > > > >> > > > >> > > >> > > >
