Hi Ted, Could be please take a look at the log I've pasted above? Thanks! On Thu, May 5, 2016 at 11:02 AM, Shuai Lin <[email protected]> wrote:
> Hi Ted, > > The hbase version is 1.0.0-cdh5.4.8, shipped with cloudera CDH 5.4.8. The > RS logs on node6 can be found here > <http://paste.openstack.org/raw/496174/> . > > Thanks! > > Shuai > > On Thu, May 5, 2016 at 9:15 AM, Ted Yu <[email protected]> wrote: > >> Can you pastebin related server log w.r.t. >> d1c7f3f455f2529da82a2f713b5ee067 >> from rs-node6 ? >> >> Which release of hbase are you using ? >> >> Cheers >> >> On Wed, May 4, 2016 at 6:07 PM, Shuai Lin <[email protected]> wrote: >> >> > Hi list, >> > >> > Last weekend I got a region server crashed, but some regions never got >> > online again on other RSes. I've gone through the logs, and here is the >> > timeline about some of the events: >> > >> > * 13:03:50 on of the region server, rs-node7, died because of a disk >> > failure. Master started to split rs-node7's WALs >> > >> > >> > 2016-04-30 13:03:50,953 INFO >> > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting >> > logs for rs-node7.example.com,60020,1458724695128 before assignment; >> > region >> > count=133 >> > 2016-04-30 13:03:50,966 DEBUG >> > org.apache.hadoop.hbase.master.SplitLogManager: Scheduling batch of >> logs to >> > split >> > 2016-04-30 13:03:50,966 INFO >> > org.apache.hadoop.hbase.master.SplitLogManager: started splitting 33 >> logs >> > in [hdfs://nameservice1/hbase/WALs/rs-node7.example.com >> > ,60020,1458724695128-splitting] >> > for [rs-node7.example.com,60020,1458724695128] >> > >> > * 13:10:47 WAL splits done, master began to re-assign regions >> > >> > 2016-04-30 13:10:47,655 INFO >> > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: >> Reassigning >> > 133 region(s) that rs-node7.example.com,60020,1458724695128 was >> carrying >> > (and 0 regions(s) that were opening on this server) >> > 2016-04-30 13:10:47,665 INFO >> > org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 133 >> > region(s) across 6 server(s), round-robin=true >> > 2016-04-30 13:10:47,667 INFO >> > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 >> region(s) to >> > rs-node1.example.com,60020,1458720625688 >> > 2016-04-30 13:10:47,667 INFO >> > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 >> region(s) to >> > rs-node2.example.com,60020,1458721110988 >> > 2016-04-30 13:10:47,667 INFO >> > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 >> region(s) to >> > rs-node3.example.com,60020,1458721713906 >> > 2016-04-30 13:10:47,679 INFO >> > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 23 >> region(s) to >> > rs-node4.example.com,60020,1458722335527 >> > 2016-04-30 13:10:47,691 INFO >> > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 >> region(s) to >> > rs-node5.example.com,60020,1458722992296 >> > 2016-04-30 13:10:47,702 INFO >> > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 >> region(s) to >> > rs-node6.example.com,60020,1458723856883 >> > >> > * 13:11:47 the opening regions rpc sent by master to region servers >> timed >> > out after 1 minutes >> > >> > 2016-04-30 13:11:47,780 INFO >> > org.apache.hadoop.hbase.master.AssignmentManager: Unable to communicate >> > with rs-node3.example.com,60020,1458721713906 in order to assign >> regions >> > java.io.IOException: Call to rs-node3.example.com/172.16.6.3:60020 >> failed >> > on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: >> Call >> > id=4, waitTime=60001, operationTimeout=60000 expired. >> > 2016-04-30 13:11:47,782 INFO >> > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 22 >> > regions to server rs-node6.example.com,60020,1458723856883, reassigning >> > them >> > 2016-04-30 13:11:47,782 INFO >> > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 23 >> > regions to server rs-node4.example.com,60020,1458722335527, reassigning >> > them >> > 2016-04-30 13:11:47,782 INFO >> > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 22 >> > regions to server rs-node3.example.com,60020,1458721713906, reassigning >> > them >> > 2016-04-30 13:11:47,783 DEBUG >> > org.apache.hadoop.hbase.master.AssignmentManager: Force region state >> > offline {a65660e421f114e93862194f7cc35644 state=OPENING, >> ts=1462021907753, >> > server=rs-node6.example.com,60020,1458723856883} >> > >> > >> > * After that, part of the regions (40 out of 130 regions) never got >> online, >> > and the following lines were logged repeatly in master log: >> > >> > 2016-04-30 13:12:37,188 DEBUG >> > org.apache.hadoop.hbase.master.AssignmentManager: update >> > {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN, ts=1462021957087, >> > server=rs-node6.example.com,60020,1458723856883} the timestamp. >> > 2016-04-30 13:12:37,188 DEBUG >> > org.apache.hadoop.hbase.master.AssignmentManager: Region is already in >> > transition; waiting up to 10668ms >> > >> > $ grep 'AssignmentManager: update {d1c7f3f455f2529da82a2f713b5ee067' >> > /var/log/hbase/hbase-cmf-hbase-MASTER-head.example.com.log.out.1|wc -l >> > 484 >> > >> > >> > I've searched in mailing list archive and hbase JIRA, but didn't find >> any >> > similar situations like this one. The most similar one is HBASE-14407 >> > <https://issues.apache.org/jira/browse/HBASE-14407> , but after reading >> > its >> > discussion I don't think that's the same problem. >> > >> > Anyone have a clue? Thanks! >> > >> > Regards, >> > Shuai >> > >> > >
