>> Mind pastebin'ing this part of master log? 2011-06-29 16:39:54,326 DEBUG org.apache.hadoop.hbase. master.handler.OpenedRegionHandler: Opened region gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. on hadoop1-s05.farm-ny.gigya.com,60020,1307349217076 2011-06-29 16:40:00,598 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x13004a31d7804c4 Creating (or updating) unassigned node for 584dac5cc70d8682f71c4675a843c309 with OFFLINE state
Eran: Was there more log between the two lines in master log ? TimeoutMonitor.chore() should have logged something if it caused region re-assignment. Thanks On Wed, Jul 6, 2011 at 10:52 PM, Stack <[email protected]> wrote: > On Sun, Jul 3, 2011 at 12:02 PM, Eran Kutner <[email protected]> wrote: > > 4. Then at 16:40:00 the master log says: master:60000-0x13004a31d7804c4 > > Creating (or updating) unassigned node for 584dac5cc70d8682f71c4675a843c3 > > 09 with OFFLINE state - why did it decide to take the region offline > after > > learning it was successfully opened? > > > My guess is that though we'd opened the region, the timeout of regions > in transition expired and it we queued assigning it elsewhere (The > first step in assigning a region elsewhere is putting the regions > znode into the OFFLINE state). Mind pastebin'ing this part of master > log? > > The issues Ted cites and the fix racyness issue I added to it are > about cutting down the span over which locks are held in the master -- > this has made for big improvements in the promptness with which the > master processes state transitions -- and then there are races between > the handling of region transitions -- e.g. opens -- down in the region > transition handlers and the running of the timeout monitor. These are > whats being addressed. > > > 5. Then it tries to reopen the region on hadoop1-s05, which indicates in > its > > log that the open request failed because the region was already open - > why > > didn't the master use that information to learn that the region was > already > > open? > > It looks like we log it as WARN on the regionserver side but do > nothing else with it. Here is the message: > > 2011-06-29 16:40:01,079 WARN > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: > Attempted open of > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > but already online on this server > > We notice we already have it opened down in the open region handler > down in the regionserver. We've let go of the connection to the > master at this stage so no way of our flagging the master that we > already have this region. What we should do is before we queue it, > check if we already have it and return the master an > AlreadyOpenException (I made HBASE-4073 to make sure we don't forget > about this one -- the root issue needs addressing but thereafter, we > should never queue the opening of a region we already have opened on > the regionserver) > > > > 7. Now the master forces the transition of the region to hadoop1-s02 but > > there is no sign of that on hadoop1-s05 - why doesn't the old RS > > (hadoop1-s05) detect that it is no longer the master and relinquishes > > control of the region? > > > Well, the master doesn't know that s05 has the region open -- thats > why it gives it to s02 -- and then, there is no channel available to > s05 to figure who has what. > > St.Ack >
