Hi Samir, It had been about 30 mins since the last round of inserts, i was performing a count of rows in the table when the region crashed.
I had another crash of a regionserver - oom which resulted in a similar problme, but this time running "hbase hbck -fix" twice was able to resolve the problems and get the cluster running again. -Ian Brooks On Tuesday 17 Jun 2014 14:11:15 Samir Ahmic wrote: > Here is explanation for WALTrailer from source code: > > * A trailer that is appended to the end of a properly closed HLog WAL > file. > > * If missing, this is either a legacy or a corrupted WAL file. > > So you probably ended with corrupted WAL files. Before dropping table you > could try to identify corrupted HLog files and remove them manually. > > Did you have some heavy write operations before this issue ? > > Regards > > > > > On Tue, Jun 17, 2014 at 1:51 PM, Ian Brooks <[email protected]> wrote: > > > Hi Samir, > > > > I managed to get the cluster running, but I had to drop and recreate one > > of the tables, which is fine for a dev cluster but not helpful in prod. > > > > There were no space issues at the time, there was ~10GB free they only > > occurred after when the logs fille up all the remaining disk space. > > > > hdfs fsck shows > > .........................................................Status: HEALTHY > > Total size: 2863459120 B > > Total dirs: 7316 > > Total files: 16357 > > Total symlinks: 0 (Files currently being written: 9) > > Total blocks (validated): 13492 (avg. block size 212233 B) (Total > > open file blocks (not validated): 9) > > Minimally replicated blocks: 13492 (100.0 %) > > Over-replicated blocks: 0 (0.0 %) > > Under-replicated blocks: 0 (0.0 %) > > Mis-replicated blocks: 0 (0.0 %) > > Default replication factor: 3 > > Average block replication: 3.0 > > Corrupt blocks: 0 > > Missing replicas: 0 (0.0 %) > > Number of data-nodes: 8 > > Number of racks: 1 > > FSCK ended at Tue Jun 17 12:50:57 BST 2014 in 1447 milliseconds > > > > > > The filesystem under path '/' is HEALTHY > > > > > > > > hbase hbck show lingering refference file errors > > > > 2014-06-17 12:46:00,985 DEBUG [main] regionserver.StoreFileInfo: reference > > 'hdfs://swcluster1/user/hbase/data/default/profiles/27b74d576c781f75455a78d000518033/recovered.edits/0000000000002583119.temp' > > to region=temp hfile=0000000000002583119 > > ERROR: Found lingering reference file > > hdfs://swcluster1/user/hbase/data/default/profiles/27b74d576c781f75455a78d000518033/recovered.edits/0000000000002583119.temp > > 2014-06-17 12:46:00,991 DEBUG [main] regionserver.StoreFileInfo: reference > > 'hdfs://swcluster1/user/hbase/data/default/rawScans/19b8606148e0cc115d787357392ff3b5/recovered.edits/0000000000002867765.temp' > > to region=temp hfile=0000000000002867765 > > > > datanode logs on all servers was clean at the time of the crash and after. > > > > hadoop version 2.4 > > hbase version 0.98.3 > > > > -Ian Brooks > > > > On Tuesday 17 Jun 2014 13:43:45 Samir Ahmic wrote: > > > Hi Ian, > > > > > > What hadoop fsck / says ? Maybe you have some corrupted data on your > > > cluster. Also try using hbase hbck do investigate issue. If you have disk > > > space issues try adding more data nodes to your cluster. Regarding errors > > > you have send they are thrown because ProtobufLogWriter is unable to > > write > > > WALTrailer. > > > Also try to check DataNode logs on servers where regions are hosted. What > > > versions of hadoop/hbase you are using ? > > > > > > Good luck > > > Samir > > > > > > > > > On Tue, Jun 17, 2014 at 11:07 AM, Ian Brooks <[email protected]> > > > wrote: > > > > > > > Hi, > > > > > > > > I had an issue last night where one of the regionservers crashed due to > > > > running out of memory while doing a count on a table with 8M rows, This > > > > left the system in a state where is though the regions on that server > > were > > > > still online but couldn't be accessed. trying to either move or > > offline the > > > > regions resulted in a timeout failure. > > > > > > > > I ended up taking the whole hbase cluster down, but when I tried to > > bring > > > > it back up again, it failed to bring those regions back online, mainly > > > > spewing out vast amounts of logs relating to IO errors on the WAL > > splitting > > > > files and .tmp files. > > > > > > > > Unfortunately I don't have most of the the logs anymore as the errors > > > > resulting in the nodes running out of diskspace on the nodes from the > > > > failure errors but most of the errors were like the one below - there > > were > > > > no errors in the hdfs logs for this time. > > > > > > > > 2014-06-16 17:47:02,902 ERROR [regionserver16020.logRoller] > > > > wal.ProtobufLogWriter: Got IOException while writing trailer > > > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File > > > > > > /user/hbase/WALs/sw-hadoop-008,16020,1402933620893/sw-hadoop-008%2C16020%2C1402933620893.1402933622645 > > > > could only be replicated to 0 nodes instead of minReplication (=1). > > There > > > > are 8 datanode(s) running and no node(s) are excluded in this > > operation. > > > > at > > > > > > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1430) > > > > at > > > > > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2684) > > > > at > > > > > > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:584) > > > > at > > > > > > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:440) > > > > at > > > > > > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > > > > at > > > > > > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) > > > > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) > > > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > > > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > > > > at java.security.AccessController.doPrivileged(Native Method) > > > > at javax.security.auth.Subject.doAs(Subject.java:415) > > > > at > > > > > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) > > > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > > > > > > > > at org.apache.hadoop.ipc.Client.call(Client.java:1347) > > > > at org.apache.hadoop.ipc.Client.call(Client.java:1300) > > > > at > > > > > > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > > > > at com.sun.proxy.$Proxy13.addBlock(Unknown Source) > > > > at > > > > > > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:330) > > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > > > at > > > > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > > > > at > > > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > > > at java.lang.reflect.Method.invoke(Method.java:606) > > > > at > > > > > > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) > > > > at > > > > > > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > > > > at com.sun.proxy.$Proxy14.addBlock(Unknown Source) > > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > > > at > > > > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > > > > at > > > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > > > at java.lang.reflect.Method.invoke(Method.java:606) > > > > at > > > > org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:294) > > > > at com.sun.proxy.$Proxy15.addBlock(Unknown Source) > > > > at > > > > > > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1226) > > > > at > > > > > > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1078) > > > > at > > > > > > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514) > > > > 2014-06-16 17:47:02,902 ERROR [regionserver16020.logRoller] wal.FSHLog: > > > > Failed close of HLog writer > > > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File > > > > > > /user/hbase/WALs/sw-hadoop-008,16020,1402933620893/sw-hadoop-008%2C16020%2C1402933620893.1402933622645 > > > > could only be replicated to 0 nodes instead of minReplication (=1). > > There > > > > are 8 datanode(s) running and no node(s) are excluded in this > > operation. > > > > at > > > > > > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1430) > > > > at > > > > > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2684) > > > > at > > > > > > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:584) > > > > at > > > > > > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:440) > > > > at > > > > > > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > > > > at > > > > > > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) > > > > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) > > > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > > > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > > > > at java.security.AccessController.doPrivileged(Native Method) > > > > at javax.security.auth.Subject.doAs(Subject.java:415) > > > > at > > > > > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) > > > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > > > > > > > > at org.apache.hadoop.ipc.Client.call(Client.java:1347) > > > > at org.apache.hadoop.ipc.Client.call(Client.java:1300) > > > > at > > > > > > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > > > > at com.sun.proxy.$Proxy13.addBlock(Unknown Source) > > > > at > > > > > > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:330) > > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > > > at > > > > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > > > > at > > > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > > > at java.lang.reflect.Method.invoke(Method.java:606) > > > > at > > > > > > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) > > > > at > > > > > > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > > > > at com.sun.proxy.$Proxy14.addBlock(Unknown Source) > > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > > > at > > > > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > > > > at > > > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > > > at java.lang.reflect.Method.invoke(Method.java:606) > > > > at > > > > org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:294) > > > > at com.sun.proxy.$Proxy15.addBlock(Unknown Source) > > > > at > > > > > > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1226) > > > > at > > > > > > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1078) > > > > at > > > > > > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514) > > > > 2014-06-16 17:47:02,909 WARN [regionserver16020.logRoller] wal.FSHLog: > > > > Riding over HLog close failure! error count=1 > > > > > > > > > > > > If the regions are marked as online but the shell won't let you do > > > > anything, what is the best/correct way to get them back online again? > > > > > > > > -Ian Brooks > > > > > > > > > > > > > > > > > >
