Here is the complete log on node6 between 13:10:47 and 13:11:47: http://paste.openstack.org/raw/505826/
The master asked node6 to open several regions. Node6 opened the first 4 very fast (within 1 seconsd) and got stuck at the 5th one. But there is no errors at that time. On Wed, May 25, 2016 at 10:12 PM, Ted Yu <[email protected]> wrote: > In AssignmentManager#assign(), you should find: > > // Send OPEN RPC. If it fails on a IOE or RemoteException, > // regions will be assigned individually. > long maxWaitTime = System.currentTimeMillis() + > this.server.getConfiguration(). > getLong("hbase.regionserver.rpc.startup.waittime", 60000); > > BTW can you see what caused rs-node6 to not respond around 13:11:47 ? > > Cheers > > On Fri, May 20, 2016 at 6:20 AM, Shuai Lin <[email protected]> wrote: > >> Because of the "opening regions" rpc call sent by master to the region >> server node6 got timed out after 1 minutes? >> >> *RPC call was sent:* >> >> 2016-04-30 13:10:47,702 INFO >> org.apache.hadoop.hbase.master.AssignmentManager: >> Assigning 22 region(s) tors-node6.example.com,60020,1458723856883 >> >> *After 1 minute:* >> >> 2016-04-30 13:11:47,780 INFO >> org.apache.hadoop.hbase.master.AssignmentManager: Unable to communicate >> with rs-node6.example.com,60020,1458723856883 in order to assign >> regions, java.io.IOException: Call to >> rs-node6.example.com/172.16.6.6:60020 failed on local exception: >> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=4, >> waitTime=60001, operationTimeout=60000 expired. >> >> 2016-04-30 13:11:47,783 DEBUG >> org.apache.hadoop.hbase.master.AssignmentManager: Force region state >> offline {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN, >> ts=1462021847743, server=rs-node6.example.com,60020,1458723856883} >> >> >> I have checked hbase source code, but don't find any specific timeout >> settings for "open region" rpc call I can use. So I guess the it's using >> the default "hbase.rpc.timeout", which defaults to 60secs. And since there >> are 20+ regions being assigned to node6 almost at the same moment, node6 >> gets overloaded and can't finish opening all of them within one minute. >> >> So this looks like a hbase bug to me (regions never get online when the >> region server failed to handle the OpenRegionRequest before the rpc >> timeout), am I right? >> >> >> On Fri, May 20, 2016 at 12:42 PM, Ted Yu <[email protected]> wrote: >> >>> Looks like region d1c7f3f455f2529da82a2f713b5ee067 received CLOSE request >>> when it was opening, leading to RegionAlreadyInTransitionException. >>> >>> Was there any clue in master log why the close request was sent ? >>> >>> Cheers >>> >>> On Wed, May 4, 2016 at 8:02 PM, Shuai Lin <[email protected]> >>> wrote: >>> >>> > Hi Ted, >>> > >>> > The hbase version is 1.0.0-cdh5.4.8, shipped with cloudera CDH 5.4.8. >>> The >>> > RS logs on node6 can be found here < >>> http://paste.openstack.org/raw/496174/ >>> > > >>> > . >>> > >>> > Thanks! >>> > >>> > Shuai >>> > >>> > On Thu, May 5, 2016 at 9:15 AM, Ted Yu <[email protected]> wrote: >>> > >>> > > Can you pastebin related server log w.r.t. >>> > d1c7f3f455f2529da82a2f713b5ee067 >>> > > from rs-node6 ? >>> > > >>> > > Which release of hbase are you using ? >>> > > >>> > > Cheers >>> > > >>> > > On Wed, May 4, 2016 at 6:07 PM, Shuai Lin <[email protected]> >>> > wrote: >>> > > >>> > > > Hi list, >>> > > > >>> > > > Last weekend I got a region server crashed, but some regions never >>> got >>> > > > online again on other RSes. I've gone through the logs, and here >>> is the >>> > > > timeline about some of the events: >>> > > > >>> > > > * 13:03:50 on of the region server, rs-node7, died because of a >>> disk >>> > > > failure. Master started to split rs-node7's WALs >>> > > > >>> > > > >>> > > > 2016-04-30 13:03:50,953 INFO >>> > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: >>> Splitting >>> > > > logs for rs-node7.example.com,60020,1458724695128 before >>> assignment; >>> > > > region >>> > > > count=133 >>> > > > 2016-04-30 13:03:50,966 DEBUG >>> > > > org.apache.hadoop.hbase.master.SplitLogManager: Scheduling batch of >>> > logs >>> > > to >>> > > > split >>> > > > 2016-04-30 13:03:50,966 INFO >>> > > > org.apache.hadoop.hbase.master.SplitLogManager: started splitting >>> 33 >>> > logs >>> > > > in [hdfs://nameservice1/hbase/WALs/rs-node7.example.com >>> > > > ,60020,1458724695128-splitting] >>> > > > for [rs-node7.example.com,60020,1458724695128] >>> > > > >>> > > > * 13:10:47 WAL splits done, master began to re-assign regions >>> > > > >>> > > > 2016-04-30 13:10:47,655 INFO >>> > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: >>> > Reassigning >>> > > > 133 region(s) that rs-node7.example.com,60020,1458724695128 was >>> > carrying >>> > > > (and 0 regions(s) that were opening on this server) >>> > > > 2016-04-30 13:10:47,665 INFO >>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning >>> 133 >>> > > > region(s) across 6 server(s), round-robin=true >>> > > > 2016-04-30 13:10:47,667 INFO >>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 >>> > region(s) >>> > > to >>> > > > rs-node1.example.com,60020,1458720625688 >>> > > > 2016-04-30 13:10:47,667 INFO >>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 >>> > region(s) >>> > > to >>> > > > rs-node2.example.com,60020,1458721110988 >>> > > > 2016-04-30 13:10:47,667 INFO >>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 >>> > region(s) >>> > > to >>> > > > rs-node3.example.com,60020,1458721713906 >>> > > > 2016-04-30 13:10:47,679 INFO >>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 23 >>> > region(s) >>> > > to >>> > > > rs-node4.example.com,60020,1458722335527 >>> > > > 2016-04-30 13:10:47,691 INFO >>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 >>> > region(s) >>> > > to >>> > > > rs-node5.example.com,60020,1458722992296 >>> > > > 2016-04-30 13:10:47,702 INFO >>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 >>> > region(s) >>> > > to >>> > > > rs-node6.example.com,60020,1458723856883 >>> > > > >>> > > > * 13:11:47 the opening regions rpc sent by master to region servers >>> > timed >>> > > > out after 1 minutes >>> > > > >>> > > > 2016-04-30 13:11:47,780 INFO >>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Unable to >>> communicate >>> > > > with rs-node3.example.com,60020,1458721713906 in order to assign >>> > regions >>> > > > java.io.IOException: Call to rs-node3.example.com/172.16.6.3:60020 >>> > > failed >>> > > > on local exception: >>> org.apache.hadoop.hbase.ipc.CallTimeoutException: >>> > > Call >>> > > > id=4, waitTime=60001, operationTimeout=60000 expired. >>> > > > 2016-04-30 13:11:47,782 INFO >>> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed >>> assigning 22 >>> > > > regions to server rs-node6.example.com,60020,1458723856883, >>> > reassigning >>> > > > them >>> > > > 2016-04-30 13:11:47,782 INFO >>> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed >>> assigning 23 >>> > > > regions to server rs-node4.example.com,60020,1458722335527, >>> > reassigning >>> > > > them >>> > > > 2016-04-30 13:11:47,782 INFO >>> > > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed >>> assigning 22 >>> > > > regions to server rs-node3.example.com,60020,1458721713906, >>> > reassigning >>> > > > them >>> > > > 2016-04-30 13:11:47,783 DEBUG >>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Force region >>> state >>> > > > offline {a65660e421f114e93862194f7cc35644 state=OPENING, >>> > > ts=1462021907753, >>> > > > server=rs-node6.example.com,60020,1458723856883} >>> > > > >>> > > > >>> > > > * After that, part of the regions (40 out of 130 regions) never got >>> > > online, >>> > > > and the following lines were logged repeatly in master log: >>> > > > >>> > > > 2016-04-30 13:12:37,188 DEBUG >>> > > > org.apache.hadoop.hbase.master.AssignmentManager: update >>> > > > {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN, >>> ts=1462021957087, >>> > > > server=rs-node6.example.com,60020,1458723856883} the timestamp. >>> > > > 2016-04-30 13:12:37,188 DEBUG >>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Region is >>> already in >>> > > > transition; waiting up to 10668ms >>> > > > >>> > > > $ grep 'AssignmentManager: update >>> {d1c7f3f455f2529da82a2f713b5ee067' >>> > > > >>> /var/log/hbase/hbase-cmf-hbase-MASTER-head.example.com.log.out.1|wc -l >>> > > > 484 >>> > > > >>> > > > >>> > > > I've searched in mailing list archive and hbase JIRA, but didn't >>> find >>> > any >>> > > > similar situations like this one. The most similar one is >>> HBASE-14407 >>> > > > <https://issues.apache.org/jira/browse/HBASE-14407> , but after >>> > reading >>> > > > its >>> > > > discussion I don't think that's the same problem. >>> > > > >>> > > > Anyone have a clue? Thanks! >>> > > > >>> > > > Regards, >>> > > > Shuai >>> > > > >>> > > >>> > >>> >> >> >
