double assignment WAS: Errors after major compaction

Ted Yu Thu, 07 Jul 2011 03:42:15 -0700

>> Mind pastebin'ing this part of master log?

2011-06-29 16:39:54,326 DEBUG
org.apache.hadoop.hbase.
master.handler.OpenedRegionHandler: Opened region
gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
on hadoop1-s05.farm-ny.gigya.com,60020,1307349217076
2011-06-29 16:40:00,598 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
master:60000-0x13004a31d7804c4 Creating (or updating) unassigned node for
584dac5cc70d8682f71c4675a843c309 with OFFLINE state


Eran:
Was there more log between the two lines in master log ?
TimeoutMonitor.chore() should have logged something if it caused region
re-assignment.

Thanks

On Wed, Jul 6, 2011 at 10:52 PM, Stack <[email protected]> wrote:

> On Sun, Jul 3, 2011 at 12:02 PM, Eran Kutner <[email protected]> wrote:
> > 4. Then at 16:40:00 the master log says: master:60000-0x13004a31d7804c4
> > Creating (or updating) unassigned node for 584dac5cc70d8682f71c4675a843c3
> > 09 with OFFLINE state - why did it decide to take the region offline
> after
> > learning it was successfully opened?
>
>
> My guess is that though we'd opened the region, the timeout of regions
> in transition expired and it we queued assigning it elsewhere (The
> first step in assigning a region elsewhere is putting the regions
> znode into the OFFLINE state).  Mind pastebin'ing this part of master
> log?
>
> The issues Ted cites and the fix racyness issue I added to it are
> about cutting down the span over which locks are held in the master --
> this has made for big improvements in the promptness with which the
> master processes state transitions -- and then there are races between
> the handling of region transitions -- e.g. opens -- down in the region
> transition handlers and the running of the timeout monitor.  These are
> whats being addressed.
>
> > 5. Then it tries to reopen the region on hadoop1-s05, which indicates in
> its
> > log that the open request failed because the region was already open -
> why
> > didn't the master use that information to learn that the region was
> already
> > open?
>
> It looks like we log it as WARN on the regionserver side but do
> nothing else with it.  Here is the message:
>
> 2011-06-29 16:40:01,079 WARN
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
> Attempted open of
>
> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
> but already online on this server
>
> We notice we already have it opened down in the open region handler
> down in the regionserver.  We've let go of the connection to the
> master at this stage so no way of our flagging the master that we
> already have this region.  What we should do is before we queue it,
> check if we already have it and return the master an
> AlreadyOpenException (I made HBASE-4073 to make sure we don't forget
> about this one -- the root issue needs addressing but thereafter, we
> should never queue the opening of a region we already have opened on
> the regionserver)
>
>
> > 7. Now the master forces the transition of the region to hadoop1-s02 but
> > there is no sign of that on hadoop1-s05 - why doesn't the old RS
> > (hadoop1-s05) detect that it is no longer the master and relinquishes
> > control of the region?
> >
> Well, the master doesn't know that s05 has the region open -- thats
> why it gives it to s02 -- and then, there is no channel available to
> s05 to figure who has what.
>
> St.Ack
>

double assignment WAS: Errors after major compaction

Reply via email to