On Mon, May 23, 2011 at 9:27 AM, Daniel Iancu <[email protected]> wrote:
> Hello everybody
> I've run into this strange problem. We run a 6 RS cluster and suddenly the
> client application started reporting errors, region not online. In the web
> console all regionserver appeared up.
What happened at this time (Check master log at this timestamp --
should give you a clue).
> I've run hbck and got strange results
...
> 12 dead servers
> search-hadoop-eu006.v300.gmx.net,60020,1305025929461
> search-hadoop-eu002.v300.gmx.net,60020,1305019508570
> search-hadoop-eu004.v300.gmx.net,60020,1305019551236
> search-hadoop-eu003.v300.gmx.net,60020,1305025688666
> search-hadoop-eu005.v300.gmx.net,60020,1305025841017
> search-hadoop-eu006.v300.gmx.net,60020,1306156842070
> search-hadoop-eu005.v300.gmx.net,60020,1305019568146
> search-hadoop-eu001.v300.gmx.net,60020,1305025543786
> search-hadoop-eu004.v300.gmx.net,60020,1305025761173
> search-hadoop-eu002.v300.gmx.net,60020,1305025611163
> search-hadoop-eu006.v300.gmx.net,60020,1305019572576
> search-hadoop-eu003.v300.gmx.net,60020,1305019547053
>
>
We used to hang on to the list of dead servers. In 0.90.2 we fixed
this ("HBASE-3580 Remove RS from DeadServer when new instance checks
in"). I'm not sure this change made it into the released cdh3 (You
might check the cdh CHANGES).
So, do the online regionservers have the same startcode (the last
number listed above?). I'd guess not.
St.Ack