Raghu, are you running on an hdfs which has append? St.Ack
On Fri, May 13, 2011 at 12:07 AM, Raghu Angadi <[email protected]> wrote: > This happened twice in two days: > > clients can't access one of the regions and consistently fails > with NotServingRegionException > "org.apache.hadoop.hbase.NotServingRegionException: > Region is not online: > users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f." ... > > restarting the region server fixes this. > > portions from RS and master are below. Please let me know if more logs are > needed. we will continue to look into this more. > > *on RegionServer* : > > seems to happen right after compaction. grep of '6136400' on region > server smf1-afz-19-sr1 : > [...] > 011-05-12 12:05:08,125 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open > region: users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f. > 2011-05-12 12:05:08,126 DEBUG > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing > open of users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f. > 2011-05-12 12:05:08,127 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: > regionserver:60020-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62 > Retrieved 114 byte(s) of data from znode > /twitter/hbase/unassigned/a0bf035ac417cdd0697464f1c48f387f; > data=region=users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f., > server=smf1-aen-35-sr1.prod.twitter.com:60000, state=M_ZK_REGION_OFFLINE > 2011-05-12 12:05:08,129 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: > Opening region: REGION => {NAME => > 'users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f.', STARTKEY > => '61364002', ENDKEY => '613930521', ENCODED => > a0bf035ac417cdd0697464f1c48f387f, TABLE => {{NAME => 'users', MAX_FILESIZE > => '130023424', FAMILIES => [{NAME => 'columns', BLOOMFILTER => 'NONE', > REPLICATION_SCOPE => '0', VERSIONS => '2', COMPRESSION => 'LZO', TTL => > '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => > 'true'}, {NAME => 'extracted', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => > '0', VERSIONS => '3', COMPRESSION => 'LZO', TTL => '2147483647', BLOCKSIZE > => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => > 'protobuf', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => > '2', COMPRESSION => 'LZO', TTL => '2147483647', BLOCKSIZE => '65536', > IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'time', BLOOMFILTER => > 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '2147483647', COMPRESSION => > 'LZO', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', > BLOCKCACHE => 'true'}]}} > 2011-05-12 12:05:08,130 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: > Instantiated users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f. > 2011-05-12 12:05:20,977 INFO org.apache.hadoop.hbase.regionserver.HRegion: > Onlined users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f.; next > sequenceid=208908217 > 2011-05-12 12:05:20,979 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: > regionserver:60020-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62 > Retrieved 128 byte(s) of data from znode > /twitter/hbase/unassigned/a0bf035ac417cdd0697464f1c48f387f; > data=region=users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f., > server=smf1-afz-19-sr1.prod.twitter.com,60020,1302734205220, > state=RS_ZK_REGION_OPENING > 2011-05-12 12:05:20,982 DEBUG > org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction > requested for users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f. > because Region has too many store files; priority=21, compaction queue > size=0 > 2011-05-12 12:05:20,982 INFO org.apache.hadoop.hbase.regionserver.HRegion: > Starting compaction on region > users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f. > 2011-05-12 12:05:20,984 INFO org.apache.hadoop.hbase.catalog.MetaEditor: > Updated row users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f. > in region .META.,,1 with server=smf1-afz-19-sr1.prod.twitter.com:60020, > startcode=1302734205220 > 2011-05-12 12:05:20,984 INFO org.apache.hadoop.hbase.regionserver.HRegion: > Onlined users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f.; next > sequenceid=208908217 > 2011-05-12 12:05:20,985 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: > regionserver:60020-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62 > Retrieved 128 byte(s) of data from znode > /twitter/hbase/unassigned/a0bf035ac417cdd0697464f1c48f387f; > data=region=users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f., > server=smf1-afz-19-sr1.prod.twitter.com,60020,1302734205220, > state=RS_ZK_REGION_OPENING > 2011-05-12 12:05:20,986 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: > regionserver:60020-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62-0x42daa1daae7ca62 > Retrieved 128 byte(s) of data from znode > /twitter/hbase/unassigned/a0bf035ac417cdd0697464f1c48f387f; > data=region=users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f., > server=smf1-afz-19-sr1.prod.twitter.com,60020,1302734205220, > state=RS_ZK_REGION_OPENING > 2011-05-12 12:05:20,986 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: > Closing users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f.: > disabling compactions & flushes > 2011-05-12 12:05:20,986 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: > Updates disabled for region > users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f. > 2011-05-12 12:05:20,986 INFO org.apache.hadoop.hbase.regionserver.HRegion: > Closed users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f. > 2011-05-12 12:05:20,987 DEBUG > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Opened > users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f. > 2011-05-12 12:05:23,392 INFO org.apache.hadoop.hbase.regionserver.HRegion: > completed compaction on region > users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f. after 2sec > 2011-05-12 12:40:44,829 DEBUG > org.apache.hadoop.hbase.regionserver.HRegionServer: > NotServingRegionException; Region is not online: > users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f. > 2011-05-12 12:40:45,862 DEBUG > org.apache.hadoop.hbase.regionserver.HRegionServer: > NotServingRegionException; Region is not online: > users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f. > > [...] > > *on Master* : from grep '61364002,1297594642368' : > [...] > 2011-05-12 12:03:54,786 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; > was=users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f. > state=CLOSED, ts=1305201834784 > 2011-05-12 12:03:54,790 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: > master:60000-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6 > Retrieved 114 byte(s) of data from znode > /twitter/hbase/unassigned/a0bf035ac417cdd0697464f1c48f387f and set watcher; > region=users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f., > server=smf1-aen-35-sr1.prod.twitter.com:60000, state=M_ZK_REGION_OFFLINE > [...] > 2011-05-12 12:04:31,813 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Assigning region > users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f. to > smf1-afz-19-sr1.prod.twitter.com,60020,1302734205220 > 2011-05-12 12:04:39,333 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: > master:60000-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6 > Retrieved 128 byte(s) of data from znode > /twitter/hbase/unassigned/a0bf035ac417cdd0697464f1c48f387f and set watcher; > region=users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f., > server=smf1-afz-19-sr1.prod.twitter.com,60020,1302734205220, > state=RS_ZK_REGION_OPENING > [...] > 2011-05-12 12:05:08,122 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition > timed out: users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f. > state=OPENING, ts=1305201871850 > 2011-05-12 12:05:08,122 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPENING > for too long, reassigning > region=users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f. > 2011-05-12 12:05:08,122 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: > master:60000-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6 > Retrieved 128 byte(s) of data from znode > /twitter/hbase/unassigned/a0bf035ac417cdd0697464f1c48f387f; > data=region=users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f., > server=smf1-afz-19-sr1.prod.twitter.com,60020,1302734205220, > state=RS_ZK_REGION_OPENING > 2011-05-12 12:05:08,124 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: > master:60000-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6-0x12daaaef8382ac6 > Retrieved 114 byte(s) of data from znode > /twitter/hbase/unassigned/a0bf035ac417cdd0697464f1c48f387f and set watcher; > region=users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f., > server=smf1-aen-35-sr1.prod.twitter.com:60000, state=M_ZK_REGION_OFFLINE > 2011-05-12 12:05:08,124 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Successfully transitioned > region=users,61364002,1297594642368.a0bf035ac417cdd0697464f1c48f387f. into > OFFLINE and forcing a new assignment > > Thanks for looking into this. > > Raghu. >
