Re: Some regions never get online after a region server crashes

Shuai Lin Fri, 27 May 2016 00:47:49 -0700

Here is the complete log on node6 between 13:10:47 and 13:11:47:
http://paste.openstack.org/raw/505826/


The master asked node6 to open several regions. Node6 opened the first 4
very fast (within 1 seconsd) and got stuck at the 5th one. But there is no
errors at that time.

On Wed, May 25, 2016 at 10:12 PM, Ted Yu <[email protected]> wrote:

> In AssignmentManager#assign(), you should find:
>
>           // Send OPEN RPC. If it fails on a IOE or RemoteException,
>           // regions will be assigned individually.
>           long maxWaitTime = System.currentTimeMillis() +
>             this.server.getConfiguration().
>               getLong("hbase.regionserver.rpc.startup.waittime", 60000);
>
> BTW can you see what caused rs-node6 to not respond around 13:11:47 ?
>
> Cheers
>
> On Fri, May 20, 2016 at 6:20 AM, Shuai Lin <[email protected]> wrote:
>
>> Because of the "opening regions" rpc call sent by master to the region
>> server node6 got timed out after 1 minutes?
>>
>> *RPC call was sent:*
>>
>> 2016-04-30 13:10:47,702 INFO 
>> org.apache.hadoop.hbase.master.AssignmentManager:
>> Assigning 22 region(s) tors-node6.example.com,60020,1458723856883
>>
>> *After 1 minute:*
>>
>> 2016-04-30 13:11:47,780 INFO
>> org.apache.hadoop.hbase.master.AssignmentManager: Unable to communicate
>> with rs-node6.example.com,60020,1458723856883 in order to assign
>> regions, java.io.IOException: Call to
>> rs-node6.example.com/172.16.6.6:60020 failed on local exception:
>> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=4,
>> waitTime=60001, operationTimeout=60000 expired.
>>
>> 2016-04-30 13:11:47,783 DEBUG
>> org.apache.hadoop.hbase.master.AssignmentManager: Force region state
>> offline {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN,
>> ts=1462021847743, server=rs-node6.example.com,60020,1458723856883}
>>
>>
>> I have checked hbase source code, but don't find any specific timeout
>> settings for "open region" rpc call I can use. So I guess the it's using
>> the default "hbase.rpc.timeout", which defaults to 60secs. And since there
>> are 20+ regions being assigned to node6 almost at the same moment, node6
>> gets overloaded and can't finish opening all of them within one minute.
>>
>> So this looks like a hbase bug to me (regions never get online when the
>> region server failed to handle the OpenRegionRequest before the rpc
>> timeout), am I right?
>>
>>
>> On Fri, May 20, 2016 at 12:42 PM, Ted Yu <[email protected]> wrote:
>>
>>> Looks like region d1c7f3f455f2529da82a2f713b5ee067 received CLOSE request
>>> when it was opening, leading to RegionAlreadyInTransitionException.
>>>
>>> Was there any clue in master log why the close request was sent ?
>>>
>>> Cheers
>>>
>>> On Wed, May 4, 2016 at 8:02 PM, Shuai Lin <[email protected]>
>>> wrote:
>>>
>>> > Hi Ted,
>>> >
>>> > The hbase version is 1.0.0-cdh5.4.8, shipped with cloudera CDH 5.4.8.
>>> The
>>> > RS logs on node6 can be found here <
>>> http://paste.openstack.org/raw/496174/
>>> > >
>>> >  .
>>> >
>>> > Thanks!
>>> >
>>> > Shuai
>>> >
>>> > On Thu, May 5, 2016 at 9:15 AM, Ted Yu <[email protected]> wrote:
>>> >
>>> > > Can you pastebin related server log w.r.t.
>>> > d1c7f3f455f2529da82a2f713b5ee067
>>> > > from rs-node6 ?
>>> > >
>>> > > Which release of hbase are you using ?
>>> > >
>>> > > Cheers
>>> > >
>>> > > On Wed, May 4, 2016 at 6:07 PM, Shuai Lin <[email protected]>
>>> > wrote:
>>> > >
>>> > > > Hi list,
>>> > > >
>>> > > > Last weekend I got a region server crashed, but some regions never
>>> got
>>> > > > online again on other RSes. I've gone through the logs, and here
>>> is the
>>> > > > timeline about some of the events:
>>> > > >
>>> > > > * 13:03:50 on of the region server, rs-node7, died because of a
>>> disk
>>> > > > failure. Master started to split rs-node7's WALs
>>> > > >
>>> > > >
>>> > > > 2016-04-30 13:03:50,953 INFO
>>> > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
>>> Splitting
>>> > > > logs for rs-node7.example.com,60020,1458724695128 before
>>> assignment;
>>> > > > region
>>> > > > count=133
>>> > > > 2016-04-30 13:03:50,966 DEBUG
>>> > > > org.apache.hadoop.hbase.master.SplitLogManager: Scheduling batch of
>>> > logs
>>> > > to
>>> > > > split
>>> > > > 2016-04-30 13:03:50,966 INFO
>>> > > > org.apache.hadoop.hbase.master.SplitLogManager: started splitting
>>> 33
>>> > logs
>>> > > > in [hdfs://nameservice1/hbase/WALs/rs-node7.example.com
>>> > > > ,60020,1458724695128-splitting]
>>> > > > for [rs-node7.example.com,60020,1458724695128]
>>> > > >
>>> > > > * 13:10:47 WAL splits done, master began to re-assign regions
>>> > > >
>>> > > > 2016-04-30 13:10:47,655 INFO
>>> > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
>>> > Reassigning
>>> > > > 133 region(s) that rs-node7.example.com,60020,1458724695128 was
>>> > carrying
>>> > > > (and 0 regions(s) that were opening on this server)
>>> > > > 2016-04-30 13:10:47,665 INFO
>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning
>>> 133
>>> > > > region(s) across 6 server(s), round-robin=true
>>> > > > 2016-04-30 13:10:47,667 INFO
>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>>> > region(s)
>>> > > to
>>> > > > rs-node1.example.com,60020,1458720625688
>>> > > > 2016-04-30 13:10:47,667 INFO
>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>>> > region(s)
>>> > > to
>>> > > > rs-node2.example.com,60020,1458721110988
>>> > > > 2016-04-30 13:10:47,667 INFO
>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>>> > region(s)
>>> > > to
>>> > > > rs-node3.example.com,60020,1458721713906
>>> > > > 2016-04-30 13:10:47,679 INFO
>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 23
>>> > region(s)
>>> > > to
>>> > > > rs-node4.example.com,60020,1458722335527
>>> > > > 2016-04-30 13:10:47,691 INFO
>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>>> > region(s)
>>> > > to
>>> > > > rs-node5.example.com,60020,1458722992296
>>> > > > 2016-04-30 13:10:47,702 INFO
>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>>> > region(s)
>>> > > to
>>> > > > rs-node6.example.com,60020,1458723856883
>>> > > >
>>> > > > * 13:11:47 the opening regions rpc sent by master to region servers
>>> > timed
>>> > > > out after 1 minutes
>>> > > >
>>> > > > 2016-04-30 13:11:47,780 INFO
>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Unable to
>>> communicate
>>> > > > with rs-node3.example.com,60020,1458721713906 in order to assign
>>> > regions
>>> > > > java.io.IOException: Call to rs-node3.example.com/172.16.6.3:60020
>>> > > failed
>>> > > > on local exception:
>>> org.apache.hadoop.hbase.ipc.CallTimeoutException:
>>> > > Call
>>> > > > id=4, waitTime=60001, operationTimeout=60000 expired.
>>> > > > 2016-04-30 13:11:47,782 INFO
>>> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed
>>> assigning 22
>>> > > > regions to server rs-node6.example.com,60020,1458723856883,
>>> > reassigning
>>> > > > them
>>> > > > 2016-04-30 13:11:47,782 INFO
>>> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed
>>> assigning 23
>>> > > > regions to server rs-node4.example.com,60020,1458722335527,
>>> > reassigning
>>> > > > them
>>> > > > 2016-04-30 13:11:47,782 INFO
>>> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed
>>> assigning 22
>>> > > > regions to server rs-node3.example.com,60020,1458721713906,
>>> > reassigning
>>> > > > them
>>> > > > 2016-04-30 13:11:47,783 DEBUG
>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Force region
>>> state
>>> > > > offline {a65660e421f114e93862194f7cc35644 state=OPENING,
>>> > > ts=1462021907753,
>>> > > > server=rs-node6.example.com,60020,1458723856883}
>>> > > >
>>> > > >
>>> > > > * After that, part of the regions (40 out of 130 regions) never got
>>> > > online,
>>> > > > and the following lines were logged repeatly in master log:
>>> > > >
>>> > > > 2016-04-30 13:12:37,188 DEBUG
>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: update
>>> > > > {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN,
>>> ts=1462021957087,
>>> > > > server=rs-node6.example.com,60020,1458723856883} the timestamp.
>>> > > > 2016-04-30 13:12:37,188 DEBUG
>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Region is
>>> already in
>>> > > > transition; waiting up to 10668ms
>>> > > >
>>> > > > $ grep 'AssignmentManager: update
>>> {d1c7f3f455f2529da82a2f713b5ee067'
>>> > > >
>>> /var/log/hbase/hbase-cmf-hbase-MASTER-head.example.com.log.out.1|wc -l
>>> > > > 484
>>> > > >
>>> > > >
>>> > > > I've searched in mailing list archive and hbase JIRA, but didn't
>>> find
>>> > any
>>> > > > similar situations like this one. The most similar one is
>>> HBASE-14407
>>> > > > <https://issues.apache.org/jira/browse/HBASE-14407> , but after
>>> > reading
>>> > > > its
>>> > > > discussion I don't think that's the same problem.
>>> > > >
>>> > > > Anyone have a clue? Thanks!
>>> > > >
>>> > > > Regards,
>>> > > > Shuai
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>

Re: Some regions never get online after a region server crashes

Reply via email to