Re: Regionserver crash

Samir Ahmic Tue, 17 Jun 2014 05:12:08 -0700

Here is explanation for WALTrailer from source code:

   * A trailer that is appended to the end of a properly closed HLog WAL
file.


   * If missing, this is either a legacy or a corrupted WAL file.

So you probably ended with corrupted WAL files. Before dropping table you
could try to identify corrupted HLog files and remove them manually.

Did you have some heavy write operations before this issue ?

Regards




On Tue, Jun 17, 2014 at 1:51 PM, Ian Brooks <[email protected]> wrote:

> Hi Samir,
>
> I managed to get the cluster running, but I had to drop and recreate one
> of the tables, which is fine for a dev cluster but not helpful in prod.
>
> There were no space issues at the time, there was ~10GB free they only
> occurred after when the logs fille up all the remaining disk space.
>
> hdfs fsck shows
> .........................................................Status: HEALTHY
>  Total size:    2863459120 B
>  Total dirs:    7316
>  Total files:   16357
>  Total symlinks:                0 (Files currently being written: 9)
>  Total blocks (validated):      13492 (avg. block size 212233 B) (Total
> open file blocks (not validated): 9)
>  Minimally replicated blocks:   13492 (100.0 %)
>  Over-replicated blocks:        0 (0.0 %)
>  Under-replicated blocks:       0 (0.0 %)
>  Mis-replicated blocks:         0 (0.0 %)
>  Default replication factor:    3
>  Average block replication:     3.0
>  Corrupt blocks:                0
>  Missing replicas:              0 (0.0 %)
>  Number of data-nodes:          8
>  Number of racks:               1
> FSCK ended at Tue Jun 17 12:50:57 BST 2014 in 1447 milliseconds
>
>
> The filesystem under path '/' is HEALTHY
>
>
>
> hbase hbck show lingering refference file errors
>
> 2014-06-17 12:46:00,985 DEBUG [main] regionserver.StoreFileInfo: reference
> 'hdfs://swcluster1/user/hbase/data/default/profiles/27b74d576c781f75455a78d000518033/recovered.edits/0000000000002583119.temp'
> to region=temp hfile=0000000000002583119
> ERROR: Found lingering reference file
> hdfs://swcluster1/user/hbase/data/default/profiles/27b74d576c781f75455a78d000518033/recovered.edits/0000000000002583119.temp
> 2014-06-17 12:46:00,991 DEBUG [main] regionserver.StoreFileInfo: reference
> 'hdfs://swcluster1/user/hbase/data/default/rawScans/19b8606148e0cc115d787357392ff3b5/recovered.edits/0000000000002867765.temp'
> to region=temp hfile=0000000000002867765
>
> datanode logs on all servers was clean at the time of the crash and after.
>
> hadoop version 2.4
> hbase version 0.98.3
>
> -Ian Brooks
>
> On Tuesday 17 Jun 2014 13:43:45 Samir Ahmic wrote:
> > Hi Ian,
> >
> > What hadoop fsck / says ? Maybe you have some corrupted data on your
> > cluster. Also try using hbase hbck do investigate issue. If you have disk
> > space issues try adding more data nodes to your cluster. Regarding errors
> >  you have send they are thrown because ProtobufLogWriter is unable to
> write
> > WALTrailer.
> > Also try to check DataNode logs on servers where regions are hosted. What
> > versions of hadoop/hbase you are using ?
> >
> > Good luck
> > Samir
> >
> >
> > On Tue, Jun 17, 2014 at 11:07 AM, Ian Brooks <[email protected]>
> > wrote:
> >
> > > Hi,
> > >
> > > I had an issue last night where one of the regionservers crashed due to
> > > running out of memory while doing a count on a table with 8M rows, This
> > > left the system in a state where is though the regions on that server
> were
> > > still online but couldn't be accessed. trying to either move or
> offline the
> > > regions resulted in a timeout failure.
> > >
> > > I ended up taking the whole hbase cluster down, but when I tried to
> bring
> > > it back up again, it failed to bring those regions back online, mainly
> > > spewing out vast amounts of logs relating to IO errors on the WAL
> splitting
> > > files and .tmp files.
> > >
> > > Unfortunately I don't have most of the the logs anymore as the errors
> > > resulting in the nodes running out of diskspace on the nodes from the
> > > failure errors but most of the errors were like the one below  - there
> were
> > > no errors in the hdfs logs for this time.
> > >
> > > 2014-06-16 17:47:02,902 ERROR [regionserver16020.logRoller]
> > > wal.ProtobufLogWriter: Got IOException while writing trailer
> > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> > >
> /user/hbase/WALs/sw-hadoop-008,16020,1402933620893/sw-hadoop-008%2C16020%2C1402933620893.1402933622645
> > > could only be replicated to 0 nodes instead of minReplication (=1).
>  There
> > > are 8 datanode(s) running and no node(s) are excluded in this
> operation.
> > >         at
> > >
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1430)
> > >         at
> > >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2684)
> > >         at
> > >
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:584)
> > >         at
> > >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:440)
> > >         at
> > >
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> > >         at
> > >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
> > >         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
> > >         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
> > >         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
> > >         at java.security.AccessController.doPrivileged(Native Method)
> > >         at javax.security.auth.Subject.doAs(Subject.java:415)
> > >         at
> > >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> > >         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> > >
> > >         at org.apache.hadoop.ipc.Client.call(Client.java:1347)
> > >         at org.apache.hadoop.ipc.Client.call(Client.java:1300)
> > >         at
> > >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> > >         at com.sun.proxy.$Proxy13.addBlock(Unknown Source)
> > >         at
> > >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:330)
> > >         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > >         at
> > >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> > >         at
> > >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > >         at java.lang.reflect.Method.invoke(Method.java:606)
> > >         at
> > >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
> > >         at
> > >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> > >         at com.sun.proxy.$Proxy14.addBlock(Unknown Source)
> > >         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > >         at
> > >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> > >         at
> > >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > >         at java.lang.reflect.Method.invoke(Method.java:606)
> > >         at
> > > org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:294)
> > >         at com.sun.proxy.$Proxy15.addBlock(Unknown Source)
> > >         at
> > >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1226)
> > >         at
> > >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1078)
> > >         at
> > >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)
> > > 2014-06-16 17:47:02,902 ERROR [regionserver16020.logRoller] wal.FSHLog:
> > > Failed close of HLog writer
> > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> > >
> /user/hbase/WALs/sw-hadoop-008,16020,1402933620893/sw-hadoop-008%2C16020%2C1402933620893.1402933622645
> > > could only be replicated to 0 nodes instead of minReplication (=1).
>  There
> > > are 8 datanode(s) running and no node(s) are excluded in this
> operation.
> > >         at
> > >
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1430)
> > >         at
> > >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2684)
> > >         at
> > >
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:584)
> > >         at
> > >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:440)
> > >         at
> > >
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> > >         at
> > >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
> > >         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
> > >         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
> > >         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
> > >         at java.security.AccessController.doPrivileged(Native Method)
> > >         at javax.security.auth.Subject.doAs(Subject.java:415)
> > >         at
> > >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> > >         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> > >
> > >         at org.apache.hadoop.ipc.Client.call(Client.java:1347)
> > >         at org.apache.hadoop.ipc.Client.call(Client.java:1300)
> > >         at
> > >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> > >         at com.sun.proxy.$Proxy13.addBlock(Unknown Source)
> > >         at
> > >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:330)
> > >         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > >         at
> > >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> > >         at
> > >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > >         at java.lang.reflect.Method.invoke(Method.java:606)
> > >         at
> > >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
> > >         at
> > >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> > >         at com.sun.proxy.$Proxy14.addBlock(Unknown Source)
> > >         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > >         at
> > >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> > >         at
> > >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > >         at java.lang.reflect.Method.invoke(Method.java:606)
> > >         at
> > > org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:294)
> > >         at com.sun.proxy.$Proxy15.addBlock(Unknown Source)
> > >         at
> > >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1226)
> > >         at
> > >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1078)
> > >         at
> > >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)
> > > 2014-06-16 17:47:02,909 WARN  [regionserver16020.logRoller] wal.FSHLog:
> > > Riding over HLog close failure! error count=1
> > >
> > >
> > > If the regions are marked as online but the shell won't let you do
> > > anything, what is the best/correct way to get them back online again?
> > >
> > > -Ian Brooks
> > >
> > >
> > >
> > >
>

Re: Regionserver crash

Reply via email to