Re: Regionserver crash

Ian Brooks Tue, 17 Jun 2014 04:52:44 -0700

Hi Samir,

I managed to get the cluster running, but I had to drop and recreate one of the 
tables, which is fine for a dev cluster but not helpful in prod.


There were no space issues at the time, there was ~10GB free they only occurred 
after when the logs fille up all the remaining disk space.

hdfs fsck shows
.........................................................Status: HEALTHY
 Total size:    2863459120 B
 Total dirs:    7316
 Total files:   16357
 Total symlinks:                0 (Files currently being written: 9)
 Total blocks (validated):      13492 (avg. block size 212233 B) (Total open 
file blocks (not validated): 9)
 Minimally replicated blocks:   13492 (100.0 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       0 (0.0 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     3.0
 Corrupt blocks:                0
 Missing replicas:              0 (0.0 %)
 Number of data-nodes:          8
 Number of racks:               1
FSCK ended at Tue Jun 17 12:50:57 BST 2014 in 1447 milliseconds


The filesystem under path '/' is HEALTHY



hbase hbck show lingering refference file errors

2014-06-17 12:46:00,985 DEBUG [main] regionserver.StoreFileInfo: reference 
'hdfs://swcluster1/user/hbase/data/default/profiles/27b74d576c781f75455a78d000518033/recovered.edits/0000000000002583119.temp'
 to region=temp hfile=0000000000002583119
ERROR: Found lingering reference file 
hdfs://swcluster1/user/hbase/data/default/profiles/27b74d576c781f75455a78d000518033/recovered.edits/0000000000002583119.temp
2014-06-17 12:46:00,991 DEBUG [main] regionserver.StoreFileInfo: reference 
'hdfs://swcluster1/user/hbase/data/default/rawScans/19b8606148e0cc115d787357392ff3b5/recovered.edits/0000000000002867765.temp'
 to region=temp hfile=0000000000002867765

datanode logs on all servers was clean at the time of the crash and after.

hadoop version 2.4
hbase version 0.98.3

-Ian Brooks

On Tuesday 17 Jun 2014 13:43:45 Samir Ahmic wrote:
> Hi Ian,
> 
> What hadoop fsck / says ? Maybe you have some corrupted data on your
> cluster. Also try using hbase hbck do investigate issue. If you have disk
> space issues try adding more data nodes to your cluster. Regarding errors
>  you have send they are thrown because ProtobufLogWriter is unable to write
> WALTrailer.
> Also try to check DataNode logs on servers where regions are hosted. What
> versions of hadoop/hbase you are using ?
> 
> Good luck
> Samir
> 
> 
> On Tue, Jun 17, 2014 at 11:07 AM, Ian Brooks <[email protected]>
> wrote:
> 
> > Hi,
> >
> > I had an issue last night where one of the regionservers crashed due to
> > running out of memory while doing a count on a table with 8M rows, This
> > left the system in a state where is though the regions on that server were
> > still online but couldn't be accessed. trying to either move or offline the
> > regions resulted in a timeout failure.
> >
> > I ended up taking the whole hbase cluster down, but when I tried to bring
> > it back up again, it failed to bring those regions back online, mainly
> > spewing out vast amounts of logs relating to IO errors on the WAL splitting
> > files and .tmp files.
> >
> > Unfortunately I don't have most of the the logs anymore as the errors
> > resulting in the nodes running out of diskspace on the nodes from the
> > failure errors but most of the errors were like the one below  - there were
> > no errors in the hdfs logs for this time.
> >
> > 2014-06-16 17:47:02,902 ERROR [regionserver16020.logRoller]
> > wal.ProtobufLogWriter: Got IOException while writing trailer
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> > /user/hbase/WALs/sw-hadoop-008,16020,1402933620893/sw-hadoop-008%2C16020%2C1402933620893.1402933622645
> > could only be replicated to 0 nodes instead of minReplication (=1).  There
> > are 8 datanode(s) running and no node(s) are excluded in this operation.
> >         at
> > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1430)
> >         at
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2684)
> >         at
> > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:584)
> >         at
> > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:440)
> >         at
> > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> >         at
> > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
> >         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
> >         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
> >         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
> >         at java.security.AccessController.doPrivileged(Native Method)
> >         at javax.security.auth.Subject.doAs(Subject.java:415)
> >         at
> > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> >         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> >
> >         at org.apache.hadoop.ipc.Client.call(Client.java:1347)
> >         at org.apache.hadoop.ipc.Client.call(Client.java:1300)
> >         at
> > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> >         at com.sun.proxy.$Proxy13.addBlock(Unknown Source)
> >         at
> > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:330)
> >         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >         at
> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> >         at
> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >         at java.lang.reflect.Method.invoke(Method.java:606)
> >         at
> > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
> >         at
> > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> >         at com.sun.proxy.$Proxy14.addBlock(Unknown Source)
> >         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >         at
> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> >         at
> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >         at java.lang.reflect.Method.invoke(Method.java:606)
> >         at
> > org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:294)
> >         at com.sun.proxy.$Proxy15.addBlock(Unknown Source)
> >         at
> > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1226)
> >         at
> > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1078)
> >         at
> > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)
> > 2014-06-16 17:47:02,902 ERROR [regionserver16020.logRoller] wal.FSHLog:
> > Failed close of HLog writer
> > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> > /user/hbase/WALs/sw-hadoop-008,16020,1402933620893/sw-hadoop-008%2C16020%2C1402933620893.1402933622645
> > could only be replicated to 0 nodes instead of minReplication (=1).  There
> > are 8 datanode(s) running and no node(s) are excluded in this operation.
> >         at
> > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1430)
> >         at
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2684)
> >         at
> > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:584)
> >         at
> > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:440)
> >         at
> > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> >         at
> > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
> >         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
> >         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
> >         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
> >         at java.security.AccessController.doPrivileged(Native Method)
> >         at javax.security.auth.Subject.doAs(Subject.java:415)
> >         at
> > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> >         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
> >
> >         at org.apache.hadoop.ipc.Client.call(Client.java:1347)
> >         at org.apache.hadoop.ipc.Client.call(Client.java:1300)
> >         at
> > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> >         at com.sun.proxy.$Proxy13.addBlock(Unknown Source)
> >         at
> > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:330)
> >         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >         at
> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> >         at
> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >         at java.lang.reflect.Method.invoke(Method.java:606)
> >         at
> > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
> >         at
> > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> >         at com.sun.proxy.$Proxy14.addBlock(Unknown Source)
> >         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >         at
> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> >         at
> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >         at java.lang.reflect.Method.invoke(Method.java:606)
> >         at
> > org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:294)
> >         at com.sun.proxy.$Proxy15.addBlock(Unknown Source)
> >         at
> > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1226)
> >         at
> > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1078)
> >         at
> > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:514)
> > 2014-06-16 17:47:02,909 WARN  [regionserver16020.logRoller] wal.FSHLog:
> > Riding over HLog close failure! error count=1
> >
> >
> > If the regions are marked as online but the shell won't let you do
> > anything, what is the best/correct way to get them back online again?
> >
> > -Ian Brooks
> >
> >
> >
> >

Re: Regionserver crash

Reply via email to