Hi Folks,

We are running HBase1.1.2. It seems like we are hitting
https://issues.apache.org/jira/browse/HBASE-16464 in our Production
cluster. Our oldwals folder has grown to 9.5Tb. I am aware that this is
fixed in releases after 2016 but unfortunately we need to operate this
production cluster for few more months. (We are already migrating to a
newer version of HBase).

I have verified that we dont have any snapshots in this cluster. Also, we
removed all the replication_peers from that cluster. We have already
restarted HBase master a few days ago but it didnt help.  We have TB's of
oldwal and tens of thousand of recovered edit files.(assuming recovered
edits files are cleaned up by chore cleaner). Seems like the problem
started happening around mid december but at that time we didnt do any
major thing on this cluster.

I would like to see if there is a workaround for HBASE-16464? Is there any
references left to those deleted snapshots in hdfs or zk? If yes, how can i
clean up?

I keep on seeing this in HMaster logs:
2018-02-07 09:10:08,514 ERROR
[hdpmaster6.bigdataprod1.wh.truecarcorp.com,60000,1517601353645_ChoreService_3]
snapshot.SnapshotHFileCleaner: Exception while checking if files were
valid, keeping them just in case.
org.apache.hadoop.hbase.snapshot.CorruptedSnapshotException: Couldn't read
snapshot info
from:hdfs://PRODNN/apps/hbase/data/.hbase-snapshot/.tmp/LEAD_SALES-1517979610/.snapshotinfo
    at
org.apache.hadoop.hbase.snapshot.SnapshotDescriptionUtils.readSnapshotInfo(SnapshotDescriptionUtils.java:313)
    at
org.apache.hadoop.hbase.snapshot.SnapshotReferenceUtil.getHFileNames(SnapshotReferenceUtil.java:328)
    at
org.apache.hadoop.hbase.master.snapshot.SnapshotHFileCleaner$1.filesUnderSnapshot(SnapshotHFileCleaner.java:85)
    at
org.apache.hadoop.hbase.master.snapshot.SnapshotFileCache.getSnapshotsInProgress(SnapshotFileCache.java:303)
    at
org.apache.hadoop.hbase.master.snapshot.SnapshotFileCache.getUnreferencedFiles(SnapshotFileCache.java:194)
    at
org.apache.hadoop.hbase.master.snapshot.SnapshotHFileCleaner.getDeletableFiles(SnapshotHFileCleaner.java:62)
    at
org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteFiles(CleanerChore.java:233)
    at
org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteEntries(CleanerChore.java:157)
    at
org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteDirectory(CleanerChore.java:180)
    at
org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteEntries(CleanerChore.java:149)
    at
org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteDirectory(CleanerChore.java:180)
    at
org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteEntries(CleanerChore.java:149)
    at
org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteDirectory(CleanerChore.java:180)
    at
org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteEntries(CleanerChore.java:149)
    at
org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteDirectory(CleanerChore.java:180)
    at
org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteEntries(CleanerChore.java:149)
    at
org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteDirectory(CleanerChore.java:180)
    at
org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteEntries(CleanerChore.java:149)
    at
org.apache.hadoop.hbase.master.cleaner.CleanerChore.chore(CleanerChore.java:124)
    at org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:185)
    at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException: File does not exist:
/apps/hbase/data/.hbase-snapshot/.tmp/LEAD_SALES-1517979610/.snapshotinfo
    at
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
    at
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
    at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828)
    at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799)
    at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
    at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:652)
    at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)
    at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
    at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)

    at sun.reflect.GeneratedConstructorAccessor22.newInstance(Unknown
Source)
    at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
    at
org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
    at
org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
    at
org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1242)
    at
org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1227)
    at
org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1215)
    at
org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:303)
    at
org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:269)
    at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:261)
    at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1540)
    at
org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:303)
    at
org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:299)
    at
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at
org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:299)
    at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
    at
org.apache.hadoop.hbase.snapshot.SnapshotDescriptionUtils.readSnapshotInfo(SnapshotDescriptionUtils.java:306)
    ... 26 more


-- 
Thanks & Regards,
Anil Gupta

Reply via email to