Given the log snippet, I'd guess its because your hbase doesn't have HBASE-2643.
The above makes it so we continue through an EOF exception when
splitting logs where before we'd fail the splitting, requeue, split,
then fail again.
Here is comment recently added to our little hbase book at src/docbkx/book.xml:
<section>
<title>How EOFExceptions are treated when splitting a crashed
RegionServers' WALs</title>
<para>If we get an EOF while splitting logs, we proceed with the split
even when <varname>hbase.hlog.split.skip.errors</varname> ==
<constant>false</constant>. An EOF while reading the last log in the
set of files to split is near-guaranteed since the RegionServer likely
crashed mid-write of a record. But we'll continue even if we got an
EOF reading other than the last file in the set.<footnote>
<para>For background, see <link
xlink:href="https://issues.apache.org/jira/browse/HBASE-2643">HBASE-2643
Figure how to deal with eof splitting logs</link></para>
</footnote></para>
</section>
St.Ack
On Tue, Sep 21, 2010 at 3:00 PM, Jack Levin <[email protected]> wrote:
> First, I saw:
>
>
> 2010-09-21 11:30:05,122 DEBUG
> org.apache.hadoop.hbase.master.RegionServerOperationQueue: Put
> ProcessServerShutdown of 10.103.2.5,60020,1285042335711 back on queue
> 2010-09-21 11:30:05,122 DEBUG
> org.apache.hadoop.hbase.master.RegionServerOperationQueue: Processing
> todo: ProcessServerShutdown of 10.103.2.5,60020,1285042335711
> 2010-09-21 11:30:05,122 INFO
> org.apache.hadoop.hbase.master.RegionServerOperation: Process shutdown
> of server 10.103.2.5,60020,1285042335711: logSplit: false,
> rootRescanned: false, n
> umberOfMetaRegions: 1, onlineMetaRegions.size(): 0
>
> repeated rapidly for 20 mins or so.
>
> Then:
>
> Bunch of regions got unassigned:
>
>
> 2010-09-21 12:00:07,782 DEBUG
> org.apache.hadoop.hbase.master.RegionManager: Unassigning 66 regions
> from 10.103.2.3,60020,1285042333293
> 2010-09-21 12:00:07,782 DEBUG
> org.apache.hadoop.hbase.master.RegionManager: Going to close region
> img816,img2103r.jpg,1285003791610.1592893332
> 2010-09-21 12:00:07,782 DEBUG
> org.apache.hadoop.hbase.master.RegionManager: Going to close region
> img534,92166039.jpg,1284949117852.1009352950
> 2010-09-21 12:00:07,782 DEBUG
> org.apache.hadoop.hbase.master.RegionManager: Going to close region
> img36,abcwu.jpg,1285001278990.272235177
>
>
> Restarting master did not help. Ultimately what brought the cluster
> back up, is full shutdown of regionservers, and masters, and then
> bring all up.
>
> Any ideas what might have happened here?
>
> We are running:
>
> HBase Version 0.89.20100726, r979826
> Hadoop Version 0.20.2+320, r9b72d268a0b590b4fd7d13aca17c1c453f8bc957
> Regions On FS 5057
>
> 3 zookeepers and 13 regionservers.
>
> -Jack
>