In AssignmentManager#assign(), you should find: // Send OPEN RPC. If it fails on a IOE or RemoteException, // regions will be assigned individually. long maxWaitTime = System.currentTimeMillis() + this.server.getConfiguration(). getLong("hbase.regionserver.rpc.startup.waittime", 60000);
BTW can you see what caused rs-node6 to not respond around 13:11:47 ? Cheers On Fri, May 20, 2016 at 6:20 AM, Shuai Lin <linshuai2...@gmail.com> wrote: > Because of the "opening regions" rpc call sent by master to the region > server node6 got timed out after 1 minutes? > > *RPC call was sent:* > > 2016-04-30 13:10:47,702 INFO org.apache.hadoop.hbase.master.AssignmentManager: > Assigning 22 region(s) tors-node6.example.com,60020,1458723856883 > > *After 1 minute:* > > 2016-04-30 13:11:47,780 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Unable to communicate > with rs-node6.example.com,60020,1458723856883 in order to assign regions, > java.io.IOException: Call to rs-node6.example.com/172.16.6.6:60020 failed > on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call > id=4, waitTime=60001, operationTimeout=60000 expired. > > 2016-04-30 13:11:47,783 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Force region state > offline {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN, > ts=1462021847743, server=rs-node6.example.com,60020,1458723856883} > > > I have checked hbase source code, but don't find any specific timeout > settings for "open region" rpc call I can use. So I guess the it's using > the default "hbase.rpc.timeout", which defaults to 60secs. And since there > are 20+ regions being assigned to node6 almost at the same moment, node6 > gets overloaded and can't finish opening all of them within one minute. > > So this looks like a hbase bug to me (regions never get online when the > region server failed to handle the OpenRegionRequest before the rpc > timeout), am I right? > > > On Fri, May 20, 2016 at 12:42 PM, Ted Yu <yuzhih...@gmail.com> wrote: > >> Looks like region d1c7f3f455f2529da82a2f713b5ee067 received CLOSE request >> when it was opening, leading to RegionAlreadyInTransitionException. >> >> Was there any clue in master log why the close request was sent ? >> >> Cheers >> >> On Wed, May 4, 2016 at 8:02 PM, Shuai Lin <linshuai2...@gmail.com> wrote: >> >> > Hi Ted, >> > >> > The hbase version is 1.0.0-cdh5.4.8, shipped with cloudera CDH 5.4.8. >> The >> > RS logs on node6 can be found here < >> http://paste.openstack.org/raw/496174/ >> > > >> > . >> > >> > Thanks! >> > >> > Shuai >> > >> > On Thu, May 5, 2016 at 9:15 AM, Ted Yu <yuzhih...@gmail.com> wrote: >> > >> > > Can you pastebin related server log w.r.t. >> > d1c7f3f455f2529da82a2f713b5ee067 >> > > from rs-node6 ? >> > > >> > > Which release of hbase are you using ? >> > > >> > > Cheers >> > > >> > > On Wed, May 4, 2016 at 6:07 PM, Shuai Lin <linshuai2...@gmail.com> >> > wrote: >> > > >> > > > Hi list, >> > > > >> > > > Last weekend I got a region server crashed, but some regions never >> got >> > > > online again on other RSes. I've gone through the logs, and here is >> the >> > > > timeline about some of the events: >> > > > >> > > > * 13:03:50 on of the region server, rs-node7, died because of a disk >> > > > failure. Master started to split rs-node7's WALs >> > > > >> > > > >> > > > 2016-04-30 13:03:50,953 INFO >> > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: >> Splitting >> > > > logs for rs-node7.example.com,60020,1458724695128 before >> assignment; >> > > > region >> > > > count=133 >> > > > 2016-04-30 13:03:50,966 DEBUG >> > > > org.apache.hadoop.hbase.master.SplitLogManager: Scheduling batch of >> > logs >> > > to >> > > > split >> > > > 2016-04-30 13:03:50,966 INFO >> > > > org.apache.hadoop.hbase.master.SplitLogManager: started splitting 33 >> > logs >> > > > in [hdfs://nameservice1/hbase/WALs/rs-node7.example.com >> > > > ,60020,1458724695128-splitting] >> > > > for [rs-node7.example.com,60020,1458724695128] >> > > > >> > > > * 13:10:47 WAL splits done, master began to re-assign regions >> > > > >> > > > 2016-04-30 13:10:47,655 INFO >> > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: >> > Reassigning >> > > > 133 region(s) that rs-node7.example.com,60020,1458724695128 was >> > carrying >> > > > (and 0 regions(s) that were opening on this server) >> > > > 2016-04-30 13:10:47,665 INFO >> > > > org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 133 >> > > > region(s) across 6 server(s), round-robin=true >> > > > 2016-04-30 13:10:47,667 INFO >> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 >> > region(s) >> > > to >> > > > rs-node1.example.com,60020,1458720625688 >> > > > 2016-04-30 13:10:47,667 INFO >> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 >> > region(s) >> > > to >> > > > rs-node2.example.com,60020,1458721110988 >> > > > 2016-04-30 13:10:47,667 INFO >> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 >> > region(s) >> > > to >> > > > rs-node3.example.com,60020,1458721713906 >> > > > 2016-04-30 13:10:47,679 INFO >> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 23 >> > region(s) >> > > to >> > > > rs-node4.example.com,60020,1458722335527 >> > > > 2016-04-30 13:10:47,691 INFO >> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 >> > region(s) >> > > to >> > > > rs-node5.example.com,60020,1458722992296 >> > > > 2016-04-30 13:10:47,702 INFO >> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 >> > region(s) >> > > to >> > > > rs-node6.example.com,60020,1458723856883 >> > > > >> > > > * 13:11:47 the opening regions rpc sent by master to region servers >> > timed >> > > > out after 1 minutes >> > > > >> > > > 2016-04-30 13:11:47,780 INFO >> > > > org.apache.hadoop.hbase.master.AssignmentManager: Unable to >> communicate >> > > > with rs-node3.example.com,60020,1458721713906 in order to assign >> > regions >> > > > java.io.IOException: Call to rs-node3.example.com/172.16.6.3:60020 >> > > failed >> > > > on local exception: >> org.apache.hadoop.hbase.ipc.CallTimeoutException: >> > > Call >> > > > id=4, waitTime=60001, operationTimeout=60000 expired. >> > > > 2016-04-30 13:11:47,782 INFO >> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed >> assigning 22 >> > > > regions to server rs-node6.example.com,60020,1458723856883, >> > reassigning >> > > > them >> > > > 2016-04-30 13:11:47,782 INFO >> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed >> assigning 23 >> > > > regions to server rs-node4.example.com,60020,1458722335527, >> > reassigning >> > > > them >> > > > 2016-04-30 13:11:47,782 INFO >> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed >> assigning 22 >> > > > regions to server rs-node3.example.com,60020,1458721713906, >> > reassigning >> > > > them >> > > > 2016-04-30 13:11:47,783 DEBUG >> > > > org.apache.hadoop.hbase.master.AssignmentManager: Force region state >> > > > offline {a65660e421f114e93862194f7cc35644 state=OPENING, >> > > ts=1462021907753, >> > > > server=rs-node6.example.com,60020,1458723856883} >> > > > >> > > > >> > > > * After that, part of the regions (40 out of 130 regions) never got >> > > online, >> > > > and the following lines were logged repeatly in master log: >> > > > >> > > > 2016-04-30 13:12:37,188 DEBUG >> > > > org.apache.hadoop.hbase.master.AssignmentManager: update >> > > > {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN, >> ts=1462021957087, >> > > > server=rs-node6.example.com,60020,1458723856883} the timestamp. >> > > > 2016-04-30 13:12:37,188 DEBUG >> > > > org.apache.hadoop.hbase.master.AssignmentManager: Region is already >> in >> > > > transition; waiting up to 10668ms >> > > > >> > > > $ grep 'AssignmentManager: update {d1c7f3f455f2529da82a2f713b5ee067' >> > > > /var/log/hbase/hbase-cmf-hbase-MASTER-head.example.com.log.out.1|wc >> -l >> > > > 484 >> > > > >> > > > >> > > > I've searched in mailing list archive and hbase JIRA, but didn't >> find >> > any >> > > > similar situations like this one. The most similar one is >> HBASE-14407 >> > > > <https://issues.apache.org/jira/browse/HBASE-14407> , but after >> > reading >> > > > its >> > > > discussion I don't think that's the same problem. >> > > > >> > > > Anyone have a clue? Thanks! >> > > > >> > > > Regards, >> > > > Shuai >> > > > >> > > >> > >> > >