Hi Samir, I managed to get the cluster running, but I had to drop and recreate one of the tables, which is fine for a dev cluster but not helpful in prod.
There were no space issues at the time, there was ~10GB free they only occurred after when the logs fille up all the remaining disk space. hdfs fsck shows .........................................................Status: HEALTHY Total size: 2863459120 B Total dirs: 7316 Total files: 16357 Total symlinks: 0 (Files currently being written: 9) Total blocks (validated): 13492 (avg. block size 212233 B) (Total open file blocks (not validated): 9) Minimally replicated blocks: 13492 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 3 Average block replication: 3.0 Corrupt blocks: 0 Missing replicas: 0 (0.0 %) Number of data-nodes: 8 Number of racks: 1 FSCK ended at Tue Jun 17 12:50:57 BST 2014 in 1447 milliseconds The filesystem under path '/' is HEALTHY hbase hbck show lingering refference file errors 2014-06-17 12:46:00,985 DEBUG [main] regionserver.StoreFileInfo: reference 'hdfs://swcluster1/user/hbase/data/default/profiles/27b74d576c781f75455a78d000518033/recovered.edits/0000000000002583119.temp' to region=temp hfile=0000000000002583119 ERROR: Found lingering reference file hdfs://swcluster1/user/hbase/data/default/profiles/27b74d576c781f75455a78d000518033/recovered.edits/0000000000002583119.temp 2014-06-17 12:46:00,991 DEBUG [main] regionserver.StoreFileInfo: reference 'hdfs://swcluster1/user/hbase/data/default/rawScans/19b8606148e0cc115d787357392ff3b5/recovered.edits/0000000000002867765.temp' to region=temp hfile=0000000000002867765 datanode logs on all servers was clean at the time of the crash and after. hadoop version 2.4 hbase version 0.98.3 -Ian Brooks On Tuesday 17 Jun 2014 13:43:45 Samir Ahmic wrote: > Hi Ian, > > What hadoop fsck / says ? Maybe you have some corrupted data on your > cluster. Also try using hbase hbck do investigate issue. If you have disk > space issues try adding more data nodes to your cluster. Regarding errors > you have send they are thrown because ProtobufLogWriter is unable to write > WALTrailer. > Also try to check DataNode logs on servers where regions are hosted. What > versions of hadoop/hbase you are using ? > > Good luck > Samir > > > On Tue, Jun 17, 2014 at 11:07 AM, Ian Brooks <[email protected]> > wrote: > > > Hi, > > > > I had an issue last night where one of the regionservers crashed due to > > running out of memory while doing a count on a table with 8M rows, This > > left the system in a state where is though the regions on that server were > > still online but couldn't be accessed. trying to either move or offline the > > regions resulted in a timeout failure. > > > > I ended up taking the whole hbase cluster down, but when I tried to bring > > it back up again, it failed to bring those regions back online, mainly > > spewing out vast amounts of logs relating to IO errors on the WAL splitting > > files and .tmp files. > > > > Unfortunately I don't have most of the the logs anymore as the errors > > resulting in the nodes running out of diskspace on the nodes from the > > failure errors but most of the errors were like the one below - there were > > no errors in the hdfs logs for this time. > > > > 2014-06-16 17:47:02,902 ERROR [regionserver16020.logRoller] > > wal.ProtobufLogWriter: Got IOException while writing trailer > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File > > /user/hbase/WALs/sw-hadoop-008,16020,1402933620893/sw-hadoop-008%2C16020%2C1402933620893.1402933622645 > > could only be replicated to 0 nodes instead of minReplication (=1). There > > are 8 datanode(s) running and no node(s) are excluded in this operation. > > at > > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1430) > > at > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2684) > > at > > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:584) > > at > > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:440) > > at > > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > > at > > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) > > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > > at java.security.AccessController.doPrivileged(Native Method) > > at javax.security.auth.Subject.doAs(Subject.java:415) > > at > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > > > > at org.apache.hadoop.ipc.Client.call(Client.java:1347) > > at org.apache.hadoop.ipc.Client.call(Client.java:1300) > > at > > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > > at com.sun.proxy.$Proxy13.addBlock(Unknown Source) > > at > > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:330) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > > at > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:606) > > at > > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) > > at > > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > > at com.sun.proxy.$Proxy14.addBlock(Unknown Source) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > > at > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:606) > > at > > org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:294) > > at com.sun.proxy.$Proxy15.addBlock(Unknown Source) > > at > > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1226) > > at > > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1078) > > at > > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514) > > 2014-06-16 17:47:02,902 ERROR [regionserver16020.logRoller] wal.FSHLog: > > Failed close of HLog writer > > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File > > /user/hbase/WALs/sw-hadoop-008,16020,1402933620893/sw-hadoop-008%2C16020%2C1402933620893.1402933622645 > > could only be replicated to 0 nodes instead of minReplication (=1). There > > are 8 datanode(s) running and no node(s) are excluded in this operation. > > at > > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1430) > > at > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2684) > > at > > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:584) > > at > > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:440) > > at > > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > > at > > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) > > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > > at java.security.AccessController.doPrivileged(Native Method) > > at javax.security.auth.Subject.doAs(Subject.java:415) > > at > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > > > > at org.apache.hadoop.ipc.Client.call(Client.java:1347) > > at org.apache.hadoop.ipc.Client.call(Client.java:1300) > > at > > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > > at com.sun.proxy.$Proxy13.addBlock(Unknown Source) > > at > > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:330) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > > at > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:606) > > at > > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) > > at > > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > > at com.sun.proxy.$Proxy14.addBlock(Unknown Source) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > > at > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:606) > > at > > org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:294) > > at com.sun.proxy.$Proxy15.addBlock(Unknown Source) > > at > > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1226) > > at > > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1078) > > at > > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514) > > 2014-06-16 17:47:02,909 WARN [regionserver16020.logRoller] wal.FSHLog: > > Riding over HLog close failure! error count=1 > > > > > > If the regions are marked as online but the shell won't let you do > > anything, what is the best/correct way to get them back online again? > > > > -Ian Brooks > > > > > > > >
