Raghu, are you running on an hdfs which has append?
St.Ack

On Fri, May 13, 2011 at 12:07 AM, Raghu Angadi <[email protected]> wrote:
> This happened twice in two days:
>
> clients can't access one of the regions and consistently fails
> with NotServingRegionException
> "org.apache.hadoop.hbase.NotServingRegionException:
> Region is not online:
> users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f." ...
>
> restarting the region server fixes this.
>
> portions from RS and master are below. Please let me know if more logs are
> needed. we will continue to look into this more.
>
> *on RegionServer* :
>
>    seems to happen right after compaction. grep of  '6136400' on region
> server smf1-afz-19-sr1 :
> [...]
> 011-05-12 12:05:08,125 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open
> region: users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f.
> 2011-05-12 12:05:08,126 DEBUG
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing
> open of users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f.
> 2011-05-12 12:05:08,127 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil:
> regionserver:60020-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62
> Retrieved 114 byte(s) of data from znode
> /twitter/hbase/unassigned/a0bf035ac417cdd0697464f1c48f387f;
> data=region=users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f.,
> server=smf1-aen-35-sr1.prod.twitter.com:60000, state=M_ZK_REGION_OFFLINE
> 2011-05-12 12:05:08,129 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
> Opening region: REGION => {NAME =>
> 'users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f.', STARTKEY
> => '61364002', ENDKEY => '613930521', ENCODED =>
> a0bf035ac417cdd0697464f1c48f387f, TABLE => {{NAME => 'users', MAX_FILESIZE
> => '130023424', FAMILIES => [{NAME => 'columns', BLOOMFILTER => 'NONE',
> REPLICATION_SCOPE => '0', VERSIONS => '2', COMPRESSION => 'LZO', TTL =>
> '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE =>
> 'true'}, {NAME => 'extracted', BLOOMFILTER => 'NONE', REPLICATION_SCOPE =>
> '0', VERSIONS => '3', COMPRESSION => 'LZO', TTL => '2147483647', BLOCKSIZE
> => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME =>
> 'protobuf', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS =>
> '2', COMPRESSION => 'LZO', TTL => '2147483647', BLOCKSIZE => '65536',
> IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'time', BLOOMFILTER =>
> 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '2147483647', COMPRESSION =>
> 'LZO', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false',
> BLOCKCACHE => 'true'}]}}
> 2011-05-12 12:05:08,130 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
> Instantiated users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f.
> 2011-05-12 12:05:20,977 INFO org.apache.hadoop.hbase.regionserver.HRegion:
> Onlined users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f.; next
> sequenceid=208908217
> 2011-05-12 12:05:20,979 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil:
> regionserver:60020-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62
> Retrieved 128 byte(s) of data from znode
> /twitter/hbase/unassigned/a0bf035ac417cdd0697464f1c48f387f;
> data=region=users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f.,
> server=smf1-afz-19-sr1.prod.twitter.com,60020,1302734205220,
> state=RS_ZK_REGION_OPENING
> 2011-05-12 12:05:20,982 DEBUG
> org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction
> requested for users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f.
> because Region has too many store files; priority=21, compaction queue
> size=0
> 2011-05-12 12:05:20,982 INFO org.apache.hadoop.hbase.regionserver.HRegion:
> Starting compaction on region
> users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f.
> 2011-05-12 12:05:20,984 INFO org.apache.hadoop.hbase.catalog.MetaEditor:
> Updated row users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f.
> in region .META.,,1 with server=smf1-afz-19-sr1.prod.twitter.com:60020,
> startcode=1302734205220
> 2011-05-12 12:05:20,984 INFO org.apache.hadoop.hbase.regionserver.HRegion:
> Onlined users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f.; next
> sequenceid=208908217
> 2011-05-12 12:05:20,985 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil:
> regionserver:60020-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62
> Retrieved 128 byte(s) of data from znode
> /twitter/hbase/unassigned/a0bf035ac417cdd0697464f1c48f387f;
> data=region=users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f.,
> server=smf1-afz-19-sr1.prod.twitter.com,60020,1302734205220,
> state=RS_ZK_REGION_OPENING
> 2011-05-12 12:05:20,986 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil:
> regionserver:60020-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62
> Retrieved 128 byte(s) of data from znode
> /twitter/hbase/unassigned/a0bf035ac417cdd0697464f1c48f387f;
> data=region=users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f.,
> server=smf1-afz-19-sr1.prod.twitter.com,60020,1302734205220,
> state=RS_ZK_REGION_OPENING
> 2011-05-12 12:05:20,986 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
> Closing users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f.:
> disabling compactions & flushes
> 2011-05-12 12:05:20,986 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
> Updates disabled for region
> users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f.
> 2011-05-12 12:05:20,986 INFO org.apache.hadoop.hbase.regionserver.HRegion:
> Closed users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f.
> 2011-05-12 12:05:20,987 DEBUG
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Opened
> users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f.
> 2011-05-12 12:05:23,392 INFO org.apache.hadoop.hbase.regionserver.HRegion:
> completed compaction on region
> users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f. after 2sec
> 2011-05-12 12:40:44,829 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> NotServingRegionException; Region is not online:
> users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f.
> 2011-05-12 12:40:45,862 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> NotServingRegionException; Region is not online:
> users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f.
>
> [...]
>
> *on Master* : from grep '61364002,1297594642368' :
> [...]
> 2011-05-12 12:03:54,786 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE;
> was=users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f.
> state=CLOSED, ts=1305201834784
> 2011-05-12 12:03:54,790 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil:
> master:60000-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6
> Retrieved 114 byte(s) of data from znode
> /twitter/hbase/unassigned/a0bf035ac417cdd0697464f1c48f387f and set watcher;
> region=users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f.,
> server=smf1-aen-35-sr1.prod.twitter.com:60000, state=M_ZK_REGION_OFFLINE
> [...]
> 2011-05-12 12:04:31,813 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Assigning region
> users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f. to
> smf1-afz-19-sr1.prod.twitter.com,60020,1302734205220
> 2011-05-12 12:04:39,333 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil:
> master:60000-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6
> Retrieved 128 byte(s) of data from znode
> /twitter/hbase/unassigned/a0bf035ac417cdd0697464f1c48f387f and set watcher;
> region=users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f.,
> server=smf1-afz-19-sr1.prod.twitter.com,60020,1302734205220,
> state=RS_ZK_REGION_OPENING
> [...]
> 2011-05-12 12:05:08,122 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition
> timed out:  users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f.
> state=OPENING, ts=1305201871850
> 2011-05-12 12:05:08,122 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPENING
> for too long, reassigning
> region=users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f.
> 2011-05-12 12:05:08,122 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil:
> master:60000-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6
> Retrieved 128 byte(s) of data from znode
> /twitter/hbase/unassigned/a0bf035ac417cdd0697464f1c48f387f;
> data=region=users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f.,
> server=smf1-afz-19-sr1.prod.twitter.com,60020,1302734205220,
> state=RS_ZK_REGION_OPENING
> 2011-05-12 12:05:08,124 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil:
> master:60000-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6
> Retrieved 114 byte(s) of data from znode
> /twitter/hbase/unassigned/a0bf035ac417cdd0697464f1c48f387f and set watcher;
> region=users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f.,
> server=smf1-aen-35-sr1.prod.twitter.com:60000, state=M_ZK_REGION_OFFLINE
> 2011-05-12 12:05:08,124 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Successfully transitioned
> region=users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f. into
> OFFLINE and forcing a new assignment
>
> Thanks for looking into this.
>
> Raghu.
>

Reply via email to