I think this sounds like a bug. Search in HBase JIRA first. If there is no JIRA with the same symptom, consider filing one.
Cheers On Fri, Jun 3, 2016 at 1:10 AM, Shuai Lin <linshuai2...@gmail.com> wrote: > Hi Ted, > > I'm kind of confused, so is this normal behaviour or a bug of hbase? For > me it looks like a bug, should I fire a JIRA? > > Thanks, > > Shuai > > On Fri, May 27, 2016 at 8:02 PM, Ted Yu <yuzhih...@gmail.com> wrote: > >> There were 7 regions Master tried to close which were opening but not >> yet served. >> >> d1c7f3f455f2529da82a2f713b5ee067 was one of them. >> >> On Fri, May 27, 2016 at 12:47 AM, Shuai Lin <linshuai2...@gmail.com> >> wrote: >> >>> Here is the complete log on node6 between 13:10:47 and 13:11:47: >>> http://paste.openstack.org/raw/505826/ >>> >>> The master asked node6 to open several regions. Node6 opened the first 4 >>> very fast (within 1 seconsd) and got stuck at the 5th one. But there is no >>> errors at that time. >>> >>> On Wed, May 25, 2016 at 10:12 PM, Ted Yu <yuzhih...@gmail.com> wrote: >>> >>>> In AssignmentManager#assign(), you should find: >>>> >>>> // Send OPEN RPC. If it fails on a IOE or RemoteException, >>>> // regions will be assigned individually. >>>> long maxWaitTime = System.currentTimeMillis() + >>>> this.server.getConfiguration(). >>>> getLong("hbase.regionserver.rpc.startup.waittime", 60000); >>>> >>>> BTW can you see what caused rs-node6 to not respond around 13:11:47 ? >>>> >>>> Cheers >>>> >>>> On Fri, May 20, 2016 at 6:20 AM, Shuai Lin <linshuai2...@gmail.com> >>>> wrote: >>>> >>>>> Because of the "opening regions" rpc call sent by master to the region >>>>> server node6 got timed out after 1 minutes? >>>>> >>>>> *RPC call was sent:* >>>>> >>>>> 2016-04-30 13:10:47,702 INFO >>>>> org.apache.hadoop.hbase.master.AssignmentManager: >>>>> Assigning 22 region(s) tors-node6.example.com,60020,1458723856883 >>>>> >>>>> *After 1 minute:* >>>>> >>>>> 2016-04-30 13:11:47,780 INFO >>>>> org.apache.hadoop.hbase.master.AssignmentManager: Unable to communicate >>>>> with rs-node6.example.com,60020,1458723856883 in order to assign >>>>> regions, java.io.IOException: Call to >>>>> rs-node6.example.com/172.16.6.6:60020 failed on local exception: >>>>> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=4, >>>>> waitTime=60001, operationTimeout=60000 expired. >>>>> >>>>> 2016-04-30 13:11:47,783 DEBUG >>>>> org.apache.hadoop.hbase.master.AssignmentManager: Force region state >>>>> offline {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN, >>>>> ts=1462021847743, server=rs-node6.example.com,60020,1458723856883} >>>>> >>>>> >>>>> I have checked hbase source code, but don't find any specific timeout >>>>> settings for "open region" rpc call I can use. So I guess the it's using >>>>> the default "hbase.rpc.timeout", which defaults to 60secs. And since there >>>>> are 20+ regions being assigned to node6 almost at the same moment, node6 >>>>> gets overloaded and can't finish opening all of them within one minute. >>>>> >>>>> So this looks like a hbase bug to me (regions never get online when >>>>> the region server failed to handle the OpenRegionRequest before the rpc >>>>> timeout), am I right? >>>>> >>>>> >>>>> On Fri, May 20, 2016 at 12:42 PM, Ted Yu <yuzhih...@gmail.com> wrote: >>>>> >>>>>> Looks like region d1c7f3f455f2529da82a2f713b5ee067 received CLOSE >>>>>> request >>>>>> when it was opening, leading to RegionAlreadyInTransitionException. >>>>>> >>>>>> Was there any clue in master log why the close request was sent ? >>>>>> >>>>>> Cheers >>>>>> >>>>>> On Wed, May 4, 2016 at 8:02 PM, Shuai Lin <linshuai2...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> > Hi Ted, >>>>>> > >>>>>> > The hbase version is 1.0.0-cdh5.4.8, shipped with cloudera CDH >>>>>> 5.4.8. The >>>>>> > RS logs on node6 can be found here < >>>>>> http://paste.openstack.org/raw/496174/ >>>>>> > > >>>>>> > . >>>>>> > >>>>>> > Thanks! >>>>>> > >>>>>> > Shuai >>>>>> > >>>>>> > On Thu, May 5, 2016 at 9:15 AM, Ted Yu <yuzhih...@gmail.com> wrote: >>>>>> > >>>>>> > > Can you pastebin related server log w.r.t. >>>>>> > d1c7f3f455f2529da82a2f713b5ee067 >>>>>> > > from rs-node6 ? >>>>>> > > >>>>>> > > Which release of hbase are you using ? >>>>>> > > >>>>>> > > Cheers >>>>>> > > >>>>>> > > On Wed, May 4, 2016 at 6:07 PM, Shuai Lin <linshuai2...@gmail.com >>>>>> > >>>>>> > wrote: >>>>>> > > >>>>>> > > > Hi list, >>>>>> > > > >>>>>> > > > Last weekend I got a region server crashed, but some regions >>>>>> never got >>>>>> > > > online again on other RSes. I've gone through the logs, and >>>>>> here is the >>>>>> > > > timeline about some of the events: >>>>>> > > > >>>>>> > > > * 13:03:50 on of the region server, rs-node7, died because of a >>>>>> disk >>>>>> > > > failure. Master started to split rs-node7's WALs >>>>>> > > > >>>>>> > > > >>>>>> > > > 2016-04-30 13:03:50,953 INFO >>>>>> > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: >>>>>> Splitting >>>>>> > > > logs for rs-node7.example.com,60020,1458724695128 before >>>>>> assignment; >>>>>> > > > region >>>>>> > > > count=133 >>>>>> > > > 2016-04-30 13:03:50,966 DEBUG >>>>>> > > > org.apache.hadoop.hbase.master.SplitLogManager: Scheduling >>>>>> batch of >>>>>> > logs >>>>>> > > to >>>>>> > > > split >>>>>> > > > 2016-04-30 13:03:50,966 INFO >>>>>> > > > org.apache.hadoop.hbase.master.SplitLogManager: started >>>>>> splitting 33 >>>>>> > logs >>>>>> > > > in [hdfs://nameservice1/hbase/WALs/rs-node7.example.com >>>>>> > > > ,60020,1458724695128-splitting] >>>>>> > > > for [rs-node7.example.com,60020,1458724695128] >>>>>> > > > >>>>>> > > > * 13:10:47 WAL splits done, master began to re-assign regions >>>>>> > > > >>>>>> > > > 2016-04-30 13:10:47,655 INFO >>>>>> > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: >>>>>> > Reassigning >>>>>> > > > 133 region(s) that rs-node7.example.com,60020,1458724695128 was >>>>>> > carrying >>>>>> > > > (and 0 regions(s) that were opening on this server) >>>>>> > > > 2016-04-30 13:10:47,665 INFO >>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Bulk >>>>>> assigning 133 >>>>>> > > > region(s) across 6 server(s), round-robin=true >>>>>> > > > 2016-04-30 13:10:47,667 INFO >>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 >>>>>> > region(s) >>>>>> > > to >>>>>> > > > rs-node1.example.com,60020,1458720625688 >>>>>> > > > 2016-04-30 13:10:47,667 INFO >>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 >>>>>> > region(s) >>>>>> > > to >>>>>> > > > rs-node2.example.com,60020,1458721110988 >>>>>> > > > 2016-04-30 13:10:47,667 INFO >>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 >>>>>> > region(s) >>>>>> > > to >>>>>> > > > rs-node3.example.com,60020,1458721713906 >>>>>> > > > 2016-04-30 13:10:47,679 INFO >>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 23 >>>>>> > region(s) >>>>>> > > to >>>>>> > > > rs-node4.example.com,60020,1458722335527 >>>>>> > > > 2016-04-30 13:10:47,691 INFO >>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 >>>>>> > region(s) >>>>>> > > to >>>>>> > > > rs-node5.example.com,60020,1458722992296 >>>>>> > > > 2016-04-30 13:10:47,702 INFO >>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 >>>>>> > region(s) >>>>>> > > to >>>>>> > > > rs-node6.example.com,60020,1458723856883 >>>>>> > > > >>>>>> > > > * 13:11:47 the opening regions rpc sent by master to region >>>>>> servers >>>>>> > timed >>>>>> > > > out after 1 minutes >>>>>> > > > >>>>>> > > > 2016-04-30 13:11:47,780 INFO >>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Unable to >>>>>> communicate >>>>>> > > > with rs-node3.example.com,60020,1458721713906 in order to >>>>>> assign >>>>>> > regions >>>>>> > > > java.io.IOException: Call to >>>>>> rs-node3.example.com/172.16.6.3:60020 >>>>>> > > failed >>>>>> > > > on local exception: >>>>>> org.apache.hadoop.hbase.ipc.CallTimeoutException: >>>>>> > > Call >>>>>> > > > id=4, waitTime=60001, operationTimeout=60000 expired. >>>>>> > > > 2016-04-30 13:11:47,782 INFO >>>>>> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed >>>>>> assigning 22 >>>>>> > > > regions to server rs-node6.example.com,60020,1458723856883, >>>>>> > reassigning >>>>>> > > > them >>>>>> > > > 2016-04-30 13:11:47,782 INFO >>>>>> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed >>>>>> assigning 23 >>>>>> > > > regions to server rs-node4.example.com,60020,1458722335527, >>>>>> > reassigning >>>>>> > > > them >>>>>> > > > 2016-04-30 13:11:47,782 INFO >>>>>> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed >>>>>> assigning 22 >>>>>> > > > regions to server rs-node3.example.com,60020,1458721713906, >>>>>> > reassigning >>>>>> > > > them >>>>>> > > > 2016-04-30 13:11:47,783 DEBUG >>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Force region >>>>>> state >>>>>> > > > offline {a65660e421f114e93862194f7cc35644 state=OPENING, >>>>>> > > ts=1462021907753, >>>>>> > > > server=rs-node6.example.com,60020,1458723856883} >>>>>> > > > >>>>>> > > > >>>>>> > > > * After that, part of the regions (40 out of 130 regions) never >>>>>> got >>>>>> > > online, >>>>>> > > > and the following lines were logged repeatly in master log: >>>>>> > > > >>>>>> > > > 2016-04-30 13:12:37,188 DEBUG >>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: update >>>>>> > > > {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN, >>>>>> ts=1462021957087, >>>>>> > > > server=rs-node6.example.com,60020,1458723856883} the timestamp. >>>>>> > > > 2016-04-30 13:12:37,188 DEBUG >>>>>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Region is >>>>>> already in >>>>>> > > > transition; waiting up to 10668ms >>>>>> > > > >>>>>> > > > $ grep 'AssignmentManager: update >>>>>> {d1c7f3f455f2529da82a2f713b5ee067' >>>>>> > > > >>>>>> /var/log/hbase/hbase-cmf-hbase-MASTER-head.example.com.log.out.1|wc -l >>>>>> > > > 484 >>>>>> > > > >>>>>> > > > >>>>>> > > > I've searched in mailing list archive and hbase JIRA, but >>>>>> didn't find >>>>>> > any >>>>>> > > > similar situations like this one. The most similar one is >>>>>> HBASE-14407 >>>>>> > > > <https://issues.apache.org/jira/browse/HBASE-14407> , but after >>>>>> > reading >>>>>> > > > its >>>>>> > > > discussion I don't think that's the same problem. >>>>>> > > > >>>>>> > > > Anyone have a clue? Thanks! >>>>>> > > > >>>>>> > > > Regards, >>>>>> > > > Shuai >>>>>> > > > >>>>>> > > >>>>>> > >>>>>> >>>>> >>>>> >>>> >>> >> >