Re: Some regions never get online after a region server crashes

Shuai Lin Thu, 19 May 2016 21:31:55 -0700

Hi Ted, Could be please take a look at the log I've pasted above? Thanks!

On Thu, May 5, 2016 at 11:02 AM, Shuai Lin <[email protected]> wrote:


> Hi Ted,
>
> The hbase version is 1.0.0-cdh5.4.8, shipped with cloudera CDH 5.4.8. The
> RS logs on node6 can be found here
> <http://paste.openstack.org/raw/496174/> .
>
> Thanks!
>
> Shuai
>
> On Thu, May 5, 2016 at 9:15 AM, Ted Yu <[email protected]> wrote:
>
>> Can you pastebin related server log w.r.t.
>> d1c7f3f455f2529da82a2f713b5ee067
>> from rs-node6 ?
>>
>> Which release of hbase are you using ?
>>
>> Cheers
>>
>> On Wed, May 4, 2016 at 6:07 PM, Shuai Lin <[email protected]> wrote:
>>
>> > Hi list,
>> >
>> > Last weekend I got a region server crashed, but some regions never got
>> > online again on other RSes. I've gone through the logs, and here is the
>> > timeline about some of the events:
>> >
>> > * 13:03:50 on of the region server, rs-node7, died because of a disk
>> > failure. Master started to split rs-node7's WALs
>> >
>> >
>> > 2016-04-30 13:03:50,953 INFO
>> > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting
>> > logs for rs-node7.example.com,60020,1458724695128 before assignment;
>> > region
>> > count=133
>> > 2016-04-30 13:03:50,966 DEBUG
>> > org.apache.hadoop.hbase.master.SplitLogManager: Scheduling batch of
>> logs to
>> > split
>> > 2016-04-30 13:03:50,966 INFO
>> > org.apache.hadoop.hbase.master.SplitLogManager: started splitting 33
>> logs
>> > in [hdfs://nameservice1/hbase/WALs/rs-node7.example.com
>> > ,60020,1458724695128-splitting]
>> > for [rs-node7.example.com,60020,1458724695128]
>> >
>> > * 13:10:47 WAL splits done, master began to re-assign regions
>> >
>> > 2016-04-30 13:10:47,655 INFO
>> > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
>> Reassigning
>> > 133 region(s) that rs-node7.example.com,60020,1458724695128 was
>> carrying
>> > (and 0 regions(s) that were opening on this server)
>> > 2016-04-30 13:10:47,665 INFO
>> > org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 133
>> > region(s) across 6 server(s), round-robin=true
>> > 2016-04-30 13:10:47,667 INFO
>> > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>> region(s) to
>> > rs-node1.example.com,60020,1458720625688
>> > 2016-04-30 13:10:47,667 INFO
>> > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>> region(s) to
>> > rs-node2.example.com,60020,1458721110988
>> > 2016-04-30 13:10:47,667 INFO
>> > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>> region(s) to
>> > rs-node3.example.com,60020,1458721713906
>> > 2016-04-30 13:10:47,679 INFO
>> > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 23
>> region(s) to
>> > rs-node4.example.com,60020,1458722335527
>> > 2016-04-30 13:10:47,691 INFO
>> > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>> region(s) to
>> > rs-node5.example.com,60020,1458722992296
>> > 2016-04-30 13:10:47,702 INFO
>> > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>> region(s) to
>> > rs-node6.example.com,60020,1458723856883
>> >
>> > * 13:11:47 the opening regions rpc sent by master to region servers
>> timed
>> > out after 1 minutes
>> >
>> > 2016-04-30 13:11:47,780 INFO
>> > org.apache.hadoop.hbase.master.AssignmentManager: Unable to communicate
>> > with rs-node3.example.com,60020,1458721713906 in order to assign
>> regions
>> > java.io.IOException: Call to rs-node3.example.com/172.16.6.3:60020
>> failed
>> > on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException:
>> Call
>> > id=4, waitTime=60001, operationTimeout=60000 expired.
>> > 2016-04-30 13:11:47,782 INFO
>> > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 22
>> > regions to server rs-node6.example.com,60020,1458723856883, reassigning
>> > them
>> > 2016-04-30 13:11:47,782 INFO
>> > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 23
>> > regions to server rs-node4.example.com,60020,1458722335527, reassigning
>> > them
>> > 2016-04-30 13:11:47,782 INFO
>> > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 22
>> > regions to server rs-node3.example.com,60020,1458721713906, reassigning
>> > them
>> > 2016-04-30 13:11:47,783 DEBUG
>> > org.apache.hadoop.hbase.master.AssignmentManager: Force region state
>> > offline {a65660e421f114e93862194f7cc35644 state=OPENING,
>> ts=1462021907753,
>> > server=rs-node6.example.com,60020,1458723856883}
>> >
>> >
>> > * After that, part of the regions (40 out of 130 regions) never got
>> online,
>> > and the following lines were logged repeatly in master log:
>> >
>> > 2016-04-30 13:12:37,188 DEBUG
>> > org.apache.hadoop.hbase.master.AssignmentManager: update
>> > {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN, ts=1462021957087,
>> > server=rs-node6.example.com,60020,1458723856883} the timestamp.
>> > 2016-04-30 13:12:37,188 DEBUG
>> > org.apache.hadoop.hbase.master.AssignmentManager: Region is already in
>> > transition; waiting up to 10668ms
>> >
>> > $ grep 'AssignmentManager: update {d1c7f3f455f2529da82a2f713b5ee067'
>> > /var/log/hbase/hbase-cmf-hbase-MASTER-head.example.com.log.out.1|wc -l
>> > 484
>> >
>> >
>> > I've searched in mailing list archive and hbase JIRA, but didn't find
>> any
>> > similar situations like this one. The most similar one is HBASE-14407
>> > <https://issues.apache.org/jira/browse/HBASE-14407> , but after reading
>> > its
>> > discussion I don't think that's the same problem.
>> >
>> > Anyone have a clue? Thanks!
>> >
>> > Regards,
>> > Shuai
>> >
>>
>
>

Re: Some regions never get online after a region server crashes

Reply via email to