Re: all regions unregistered over time.

Stack Tue, 21 Sep 2010 22:09:59 -0700

Given the log snippet, I'd guess its because your hbase doesn't have HBASE-2643.


The above makes it so we continue through an EOF exception when
splitting logs where before we'd fail the splitting, requeue, split,
then fail again.

Here is comment recently added to our little hbase book at src/docbkx/book.xml:

      <section>
        <title>How EOFExceptions are treated when splitting a crashed
        RegionServers' WALs</title>

        <para>If we get an EOF while splitting logs, we proceed with the split
        even when <varname>hbase.hlog.split.skip.errors</varname> ==
        <constant>false</constant>. An EOF while reading the last log in the
        set of files to split is near-guaranteed since the RegionServer likely
        crashed mid-write of a record. But we'll continue even if we got an
        EOF reading other than the last file in the set.<footnote>
            <para>For background, see <link
            
xlink:href="https://issues.apache.org/jira/browse/HBASE-2643";>HBASE-2643
            Figure how to deal with eof splitting logs</link></para>
          </footnote></para>
      </section>

St.Ack

On Tue, Sep 21, 2010 at 3:00 PM, Jack Levin <[email protected]> wrote:
> First, I saw:
>
>
> 2010-09-21 11:30:05,122 DEBUG
> org.apache.hadoop.hbase.master.RegionServerOperationQueue: Put
> ProcessServerShutdown of 10.103.2.5,60020,1285042335711 back on queue
> 2010-09-21 11:30:05,122 DEBUG
> org.apache.hadoop.hbase.master.RegionServerOperationQueue: Processing
> todo: ProcessServerShutdown of 10.103.2.5,60020,1285042335711
> 2010-09-21 11:30:05,122 INFO
> org.apache.hadoop.hbase.master.RegionServerOperation: Process shutdown
> of server 10.103.2.5,60020,1285042335711: logSplit: false,
> rootRescanned: false, n
> umberOfMetaRegions: 1, onlineMetaRegions.size(): 0
>
> repeated rapidly for 20 mins or so.
>
> Then:
>
> Bunch of regions got unassigned:
>
>
> 2010-09-21 12:00:07,782 DEBUG
> org.apache.hadoop.hbase.master.RegionManager: Unassigning 66 regions
> from 10.103.2.3,60020,1285042333293
> 2010-09-21 12:00:07,782 DEBUG
> org.apache.hadoop.hbase.master.RegionManager: Going to close region
> img816,img2103r.jpg,1285003791610.1592893332
> 2010-09-21 12:00:07,782 DEBUG
> org.apache.hadoop.hbase.master.RegionManager: Going to close region
> img534,92166039.jpg,1284949117852.1009352950
> 2010-09-21 12:00:07,782 DEBUG
> org.apache.hadoop.hbase.master.RegionManager: Going to close region
> img36,abcwu.jpg,1285001278990.272235177
>
>
> Restarting master did not help.  Ultimately what brought the cluster
> back up, is full shutdown of regionservers, and masters, and then
> bring all up.
>
> Any ideas what might have happened here?
>
> We are running:
>
> HBase Version   0.89.20100726, r979826
> Hadoop Version  0.20.2+320, r9b72d268a0b590b4fd7d13aca17c1c453f8bc957
> Regions On FS   5057
>
> 3 zookeepers and 13 regionservers.
>
> -Jack
>

Re: all regions unregistered over time.

Reply via email to