Re: Regionserver crash

Ian Brooks Thu, 19 Jun 2014 03:43:28 -0700

Hi Samir,

It had been about 30 mins since the last round of inserts, i was performing a 
count of rows in the table when the region crashed.


I had another crash of a regionserver - oom which resulted in a similar 
problme, but this time running "hbase hbck -fix" twice was able to resolve the 
problems and get the cluster running again.

-Ian Brooks

On Tuesday 17 Jun 2014 14:11:15 Samir Ahmic wrote:
> Here is explanation for WALTrailer from source code:
> 
>    * A trailer that is appended to the end of a properly closed HLog WAL
> file.
> 
>    * If missing, this is either a legacy or a corrupted WAL file.
> 
> So you probably ended with corrupted WAL files. Before dropping table you
> could try to identify corrupted HLog files and remove them manually.
> 
> Did you have some heavy write operations before this issue ?
> 
> Regards
> 
> 
> 
> 
> On Tue, Jun 17, 2014 at 1:51 PM, Ian Brooks <[email protected]> wrote:
> 
> > Hi Samir,
> >
> > I managed to get the cluster running, but I had to drop and recreate one
> > of the tables, which is fine for a dev cluster but not helpful in prod.
> >
> > There were no space issues at the time, there was ~10GB free they only
> > occurred after when the logs fille up all the remaining disk space.
> >
> > hdfs fsck shows
> > .........................................................Status: HEALTHY
> >  Total size:    2863459120 B
> >  Total dirs:    7316
> >  Total files:   16357
> >  Total symlinks:                0 (Files currently being written: 9)
> >  Total blocks (validated):      13492 (avg. block size 212233 B) (Total
> > open file blocks (not validated): 9)
> >  Minimally replicated blocks:   13492 (100.0 %)
> >  Over-replicated blocks:        0 (0.0 %)
> >  Under-replicated blocks:       0 (0.0 %)
> >  Mis-replicated blocks:         0 (0.0 %)
> >  Default replication factor:    3
> >  Average block replication:     3.0
> >  Corrupt blocks:                0
> >  Missing replicas:              0 (0.0 %)
> >  Number of data-nodes:          8
> >  Number of racks:               1
> > FSCK ended at Tue Jun 17 12:50:57 BST 2014 in 1447 milliseconds
> >
> >
> > The filesystem under path '/' is HEALTHY
> >
> >
> >
> > hbase hbck show lingering refference file errors
> >
> > 2014-06-17 12:46:00,985 DEBUG [main] regionserver.StoreFileInfo: reference
> > 'hdfs://swcluster1/user/hbase/data/default/profiles/27b74d576c781f75455a78d000518033/recovered.edits/0000000000002583119.temp'
> > to region=temp hfile=0000000000002583119
> > ERROR: Found lingering reference file
> > hdfs://swcluster1/user/hbase/data/default/profiles/27b74d576c781f75455a78d000518033/recovered.edits/0000000000002583119.temp
> > 2014-06-17 12:46:00,991 DEBUG [main] regionserver.StoreFileInfo: reference
> > 'hdfs://swcluster1/user/hbase/data/default/rawScans/19b8606148e0cc115d787357392ff3b5/recovered.edits/0000000000002867765.temp'
> > to region=temp hfile=0000000000002867765
> >
> > datanode logs on all servers was clean at the time of the crash and after.
> >
> > hadoop version 2.4
> > hbase version 0.98.3
> >
> > -Ian Brooks
> >
> > On Tuesday 17 Jun 2014 13:43:45 Samir Ahmic wrote:
> > > Hi Ian,
> > >
> > > What hadoop fsck / says ? Maybe you have some corrupted data on your
> > > cluster. Also try using hbase hbck do investigate issue. If you have disk
> > > space issues try adding more data nodes to your cluster. Regarding errors
> > >  you have send they are thrown because ProtobufLogWriter is unable to
> > write
> > > WALTrailer.
> > > Also try to check DataNode logs on servers where regions are hosted. What
> > > versions of hadoop/hbase you are using ?
> > >
> > > Good luck
> > > Samir
> > >
> > >
> > > On Tue, Jun 17, 2014 at 11:07 AM, Ian Brooks <[email protected]>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I had an issue last night where one of the regionservers crashed due to
> > > > running out of memory while doing a count on a table with 8M rows, This
> > > > left the system in a state where is though the regions on that server
> > were
> > > > still online but couldn't be accessed. trying to either move or
> > offline the
> > > > regions resulted in a timeout failure.
> > > >
> > > > I ended up taking the whole hbase cluster down, but when I tried to
> > bring
> > > > it back up again, it failed to bring those regions back online, mainly
> > > > spewing out vast amounts of logs relating to IO errors on the WAL
> > splitting
> > > > files and .tmp files.
> > > >
> > > > Unfortunately I don't have most of the the logs anymore as the errors
> > > > resulting in the nodes running out of diskspace on the nodes from the
> > > > failure errors but most of the errors were like the one below  - there
> > were
> > > > no errors in the hdfs logs for this time.
> > > >
> > > > 2014-06-16 17:47:02,902 ERROR [regionserver16020.logRoller]
> > > > wal.ProtobufLogWriter: Got IOException while writing trailer
> > > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> > > >
> > /user/hbase/WALs/sw-hadoop-008,16020,1402933620893/sw-hadoop-008%2C16020%2C1402933620893.1402933622645
> > > > could only be replicated to 0 nodes instead of minReplication (=1).
> >  There
> > > > are 8 datanode(s) running and no node(s) are excluded in this
> > operation.
> > > >         at
> > > >
> > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1430)
> > > >         at
> > > >
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2684)
> > > >         at
> > > >
> > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:584)
> > > >         at
> > > >
> > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:440)
> > > >         at
> > > >
> > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> > > >         at
> > > >
> > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
> > > >         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
> > > >         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
> > > >         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
> > > >         at java.security.AccessController.doPrivileged(Native Method)
> > > >         at javax.security.auth.Subject.doAs(Subject.java:415)
> > > >         at
> > > >
> > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> > > >         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> > > >
> > > >         at org.apache.hadoop.ipc.Client.call(Client.java:1347)
> > > >         at org.apache.hadoop.ipc.Client.call(Client.java:1300)
> > > >         at
> > > >
> > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> > > >         at com.sun.proxy.$Proxy13.addBlock(Unknown Source)
> > > >         at
> > > >
> > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:330)
> > > >         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > >         at
> > > >
> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> > > >         at
> > > >
> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > >         at java.lang.reflect.Method.invoke(Method.java:606)
> > > >         at
> > > >
> > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
> > > >         at
> > > >
> > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> > > >         at com.sun.proxy.$Proxy14.addBlock(Unknown Source)
> > > >         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > >         at
> > > >
> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> > > >         at
> > > >
> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > >         at java.lang.reflect.Method.invoke(Method.java:606)
> > > >         at
> > > > org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:294)
> > > >         at com.sun.proxy.$Proxy15.addBlock(Unknown Source)
> > > >         at
> > > >
> > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1226)
> > > >         at
> > > >
> > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1078)
> > > >         at
> > > >
> > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)
> > > > 2014-06-16 17:47:02,902 ERROR [regionserver16020.logRoller] wal.FSHLog:
> > > > Failed close of HLog writer
> > > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> > > >
> > /user/hbase/WALs/sw-hadoop-008,16020,1402933620893/sw-hadoop-008%2C16020%2C1402933620893.1402933622645
> > > > could only be replicated to 0 nodes instead of minReplication (=1).
> >  There
> > > > are 8 datanode(s) running and no node(s) are excluded in this
> > operation.
> > > >         at
> > > >
> > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1430)
> > > >         at
> > > >
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2684)
> > > >         at
> > > >
> > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:584)
> > > >         at
> > > >
> > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:440)
> > > >         at
> > > >
> > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> > > >         at
> > > >
> > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
> > > >         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
> > > >         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
> > > >         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
> > > >         at java.security.AccessController.doPrivileged(Native Method)
> > > >         at javax.security.auth.Subject.doAs(Subject.java:415)
> > > >         at
> > > >
> > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> > > >         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> > > >
> > > >         at org.apache.hadoop.ipc.Client.call(Client.java:1347)
> > > >         at org.apache.hadoop.ipc.Client.call(Client.java:1300)
> > > >         at
> > > >
> > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> > > >         at com.sun.proxy.$Proxy13.addBlock(Unknown Source)
> > > >         at
> > > >
> > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:330)
> > > >         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > >         at
> > > >
> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> > > >         at
> > > >
> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > >         at java.lang.reflect.Method.invoke(Method.java:606)
> > > >         at
> > > >
> > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
> > > >         at
> > > >
> > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> > > >         at com.sun.proxy.$Proxy14.addBlock(Unknown Source)
> > > >         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > >         at
> > > >
> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> > > >         at
> > > >
> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > >         at java.lang.reflect.Method.invoke(Method.java:606)
> > > >         at
> > > > org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:294)
> > > >         at com.sun.proxy.$Proxy15.addBlock(Unknown Source)
> > > >         at
> > > >
> > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1226)
> > > >         at
> > > >
> > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1078)
> > > >         at
> > > >
> > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)
> > > > 2014-06-16 17:47:02,909 WARN  [regionserver16020.logRoller] wal.FSHLog:
> > > > Riding over HLog close failure! error count=1
> > > >
> > > >
> > > > If the regions are marked as online but the shell won't let you do
> > > > anything, what is the best/correct way to get them back online again?
> > > >
> > > > -Ian Brooks
> > > >
> > > >
> > > >
> > > >
> >

Re: Regionserver crash

Reply via email to