[jira] [Commented] (HBASE-19681) Online snapshot creation failing with missing store file

2018-03-26 Thread Saad Mufti (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16413820#comment-16413820
 ] 

Saad Mufti commented on HBASE-19681:


Restarting the region server worked for us also. 

> Online snapshot creation failing with missing store file
> 
>
> Key: HBASE-19681
> URL: https://issues.apache.org/jira/browse/HBASE-19681
> Project: HBase
>  Issue Type: Bug
>  Components: backuprestore, Performance, scaling, snapshots
>Affects Versions: 1.3.0
> Environment: Hadoop - 2.7.3
> HBase 1.3.0
> OS - GNU/Linux x86_64
> Cluster - Amazon Elastic Mapreduce
>Reporter: Anirban Roy
>Priority: Major
> Attachments: region-server-missing file-log.doc, 
> region-server-snapshot-exception-log.doc
>
>
> We are facing problem creating online snapshot of our HBase table. The table 
> contains 20TB data and receiving ~1 writes per second. The snapshot 
> creating failing intermittently with error that some hfile missing, see the 
> detailed output below. Once we locate the region server hosting the region 
> and restart the region server, snapshot creation succeeds. It seems the 
> missing hfile removed due to minor compaction, but region server still holds 
> the pointer to the file.
> [hadoop@ip-10-0-12-164 ~]$ hbase shell
> HBase Shell; enter 'help' for list of supported commands.
> Type "exit" to leave the HBase Shell
> Version 1.3.0, rUnknown, Fri Feb 17 18:15:07 UTC 2017
>  
> hbase(main):001:0> snapshot ‘x_table’, ‘x_snapshot’
>  
> ERROR: org.apache.hadoop.hbase.snapshot.HBaseSnapshotException: Snapshot { 
> ss=x_snapshot table=x_table type=FLUSH } had an error.  Procedure x_snapshot 
> { waiting=[] done=[ip-10-0-9-31.ec2.internal,16020,1508372578254, 
> ip-10-0-0-32.ec2.internal,16020,1508372591059, 
> ip-10-0-14-221.ec2.internal,16020,1508372580873, 
> ip-10-0-15-185.ec2.internal,16020,1508372588507, 
> ip-10-0-9-43.ec2.internal,16020,1508372569107, 
> ip-10-0-10-62.ec2.internal,16020,1512885921693, 
> ip-10-0-8-216.ec2.internal,16020,1508372584133, 
> ip-10-0-1-207.ec2.internal,16020,1508372580144, 
> ip-10-0-0-173.ec2.internal,16020,1508372584969, 
> ip-10-0-4-79.ec2.internal,16020,1508372587161, 
> ip-10-0-3-165.ec2.internal,16020,1508372593566, 
> ip-10-0-14-137.ec2.internal,16020,1508372583225, 
> ip-10-0-6-33.ec2.internal,16020,1508372581587, 
> ip-10-0-15-199.ec2.internal,16020,1508372587478, 
> ip-10-0-5-253.ec2.internal,16020,1508372581243, 
> ip-10-0-1-99.ec2.internal,16020,1508372609684] }
>         at 
> org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:354)
>         at 
> org.apache.hadoop.hbase.master.MasterRpcServices.isSnapshotDone(MasterRpcServices.java:1058)
>         at 
> org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java:61089)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2328)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:123)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:188)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:168)
> Caused by: 
> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via 
> ip-10-0-3-13.ec2.internal,16020,1508372563772:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
>  java.io.FileNotFoundException: File does not exist: 
> hdfs://ip-10-0-12-164.ec2.internal:8020/user/hbase/data/default/x_table/ecbb3aeaf7c5b1f65742deab5812362c/d/f76d8827c29244b99bf9344982956523
>         at 
> org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:83)
>         at 
> org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.rethrowExceptionIfFailed(TakeSnapshotHandler.java:315)
>         at 
> org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:344)
>         ... 6 more
> Caused by: 
> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: 
> java.io.FileNotFoundException: File does not exist: 
> hdfs://ip-10-0-12-164.ec2.internal:8020/user/hbase/data/default/x_table/ecbb3aeaf7c5b1f65742deab5812362c/d/f76d8827c29244b99bf9344982956523
>         at 
> org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager$SnapshotSubprocedurePool.waitForOutstandingTasks(RegionServerSnapshotManager.java:347)
>         at 
> org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure.flushSnapshot(FlushSnapshotSubprocedure.java:140)
>         at 
> org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure.insideBarrier(FlushSnapshotSubprocedure.java:160)
>         at 
> 

[jira] [Commented] (HBASE-19681) Online snapshot creation failing with missing store file

2018-03-23 Thread Saad Mufti (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1641#comment-1641
 ] 

Saad Mufti commented on HBASE-19681:


We are facing the exact same situation in HBase 1.4.0 on AWS EMR based HBase. 
Anyone have any potential recovery process? We haven't tried restart but we 
migrated the region using the "assign" command in the shell that moved the 
region but the problems persists. We have also seen the exception in both the 
snapshot thread and compaction thread.

> Online snapshot creation failing with missing store file
> 
>
> Key: HBASE-19681
> URL: https://issues.apache.org/jira/browse/HBASE-19681
> Project: HBase
>  Issue Type: Bug
>  Components: backuprestore, Performance, scaling, snapshots
>Affects Versions: 1.3.0
> Environment: Hadoop - 2.7.3
> HBase 1.3.0
> OS - GNU/Linux x86_64
> Cluster - Amazon Elastic Mapreduce
>Reporter: Anirban Roy
>Priority: Major
> Attachments: region-server-missing file-log.doc, 
> region-server-snapshot-exception-log.doc
>
>
> We are facing problem creating online snapshot of our HBase table. The table 
> contains 20TB data and receiving ~1 writes per second. The snapshot 
> creating failing intermittently with error that some hfile missing, see the 
> detailed output below. Once we locate the region server hosting the region 
> and restart the region server, snapshot creation succeeds. It seems the 
> missing hfile removed due to minor compaction, but region server still holds 
> the pointer to the file.
> [hadoop@ip-10-0-12-164 ~]$ hbase shell
> HBase Shell; enter 'help' for list of supported commands.
> Type "exit" to leave the HBase Shell
> Version 1.3.0, rUnknown, Fri Feb 17 18:15:07 UTC 2017
>  
> hbase(main):001:0> snapshot ‘x_table’, ‘x_snapshot’
>  
> ERROR: org.apache.hadoop.hbase.snapshot.HBaseSnapshotException: Snapshot { 
> ss=x_snapshot table=x_table type=FLUSH } had an error.  Procedure x_snapshot 
> { waiting=[] done=[ip-10-0-9-31.ec2.internal,16020,1508372578254, 
> ip-10-0-0-32.ec2.internal,16020,1508372591059, 
> ip-10-0-14-221.ec2.internal,16020,1508372580873, 
> ip-10-0-15-185.ec2.internal,16020,1508372588507, 
> ip-10-0-9-43.ec2.internal,16020,1508372569107, 
> ip-10-0-10-62.ec2.internal,16020,1512885921693, 
> ip-10-0-8-216.ec2.internal,16020,1508372584133, 
> ip-10-0-1-207.ec2.internal,16020,1508372580144, 
> ip-10-0-0-173.ec2.internal,16020,1508372584969, 
> ip-10-0-4-79.ec2.internal,16020,1508372587161, 
> ip-10-0-3-165.ec2.internal,16020,1508372593566, 
> ip-10-0-14-137.ec2.internal,16020,1508372583225, 
> ip-10-0-6-33.ec2.internal,16020,1508372581587, 
> ip-10-0-15-199.ec2.internal,16020,1508372587478, 
> ip-10-0-5-253.ec2.internal,16020,1508372581243, 
> ip-10-0-1-99.ec2.internal,16020,1508372609684] }
>         at 
> org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:354)
>         at 
> org.apache.hadoop.hbase.master.MasterRpcServices.isSnapshotDone(MasterRpcServices.java:1058)
>         at 
> org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java:61089)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2328)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:123)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:188)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:168)
> Caused by: 
> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via 
> ip-10-0-3-13.ec2.internal,16020,1508372563772:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
>  java.io.FileNotFoundException: File does not exist: 
> hdfs://ip-10-0-12-164.ec2.internal:8020/user/hbase/data/default/x_table/ecbb3aeaf7c5b1f65742deab5812362c/d/f76d8827c29244b99bf9344982956523
>         at 
> org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:83)
>         at 
> org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.rethrowExceptionIfFailed(TakeSnapshotHandler.java:315)
>         at 
> org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:344)
>         ... 6 more
> Caused by: 
> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: 
> java.io.FileNotFoundException: File does not exist: 
> hdfs://ip-10-0-12-164.ec2.internal:8020/user/hbase/data/default/x_table/ecbb3aeaf7c5b1f65742deab5812362c/d/f76d8827c29244b99bf9344982956523
>         at 
> org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager$SnapshotSubprocedurePool.waitForOutstandingTasks(RegionServerSnapshotManager.java:347)
>         at 
> 

[jira] [Commented] (HBASE-19681) Online snapshot creation failing with missing store file

2018-01-05 Thread Anirban Roy (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16314095#comment-16314095
 ] 

Anirban Roy commented on HBASE-19681:
-

Could it be related to 
[lHBASE-16754|https://issues.apache.org/jira/browse/HBASE-16754] ? The stack 
traces looks identical.

> Online snapshot creation failing with missing store file
> 
>
> Key: HBASE-19681
> URL: https://issues.apache.org/jira/browse/HBASE-19681
> Project: HBase
>  Issue Type: Bug
>  Components: backup, Performance, scaling, snapshots
>Affects Versions: 1.3.0
> Environment: Hadoop - 2.7.3
> HBase 1.3.0
> OS - GNU/Linux x86_64
> Cluster - Amazon Elastic Mapreduce
>Reporter: Anirban Roy
> Attachments: region-server-missing file-log.doc, 
> region-server-snapshot-exception-log.doc
>
>
> We are facing problem creating online snapshot of our HBase table. The table 
> contains 20TB data and receiving ~1 writes per second. The snapshot 
> creating failing intermittently with error that some hfile missing, see the 
> detailed output below. Once we locate the region server hosting the region 
> and restart the region server, snapshot creation succeeds. It seems the 
> missing hfile removed due to minor compaction, but region server still holds 
> the pointer to the file.
> [hadoop@ip-10-0-12-164 ~]$ hbase shell
> HBase Shell; enter 'help' for list of supported commands.
> Type "exit" to leave the HBase Shell
> Version 1.3.0, rUnknown, Fri Feb 17 18:15:07 UTC 2017
>  
> hbase(main):001:0> snapshot ‘x_table’, ‘x_snapshot’
>  
> ERROR: org.apache.hadoop.hbase.snapshot.HBaseSnapshotException: Snapshot { 
> ss=x_snapshot table=x_table type=FLUSH } had an error.  Procedure x_snapshot 
> { waiting=[] done=[ip-10-0-9-31.ec2.internal,16020,1508372578254, 
> ip-10-0-0-32.ec2.internal,16020,1508372591059, 
> ip-10-0-14-221.ec2.internal,16020,1508372580873, 
> ip-10-0-15-185.ec2.internal,16020,1508372588507, 
> ip-10-0-9-43.ec2.internal,16020,1508372569107, 
> ip-10-0-10-62.ec2.internal,16020,1512885921693, 
> ip-10-0-8-216.ec2.internal,16020,1508372584133, 
> ip-10-0-1-207.ec2.internal,16020,1508372580144, 
> ip-10-0-0-173.ec2.internal,16020,1508372584969, 
> ip-10-0-4-79.ec2.internal,16020,1508372587161, 
> ip-10-0-3-165.ec2.internal,16020,1508372593566, 
> ip-10-0-14-137.ec2.internal,16020,1508372583225, 
> ip-10-0-6-33.ec2.internal,16020,1508372581587, 
> ip-10-0-15-199.ec2.internal,16020,1508372587478, 
> ip-10-0-5-253.ec2.internal,16020,1508372581243, 
> ip-10-0-1-99.ec2.internal,16020,1508372609684] }
>         at 
> org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:354)
>         at 
> org.apache.hadoop.hbase.master.MasterRpcServices.isSnapshotDone(MasterRpcServices.java:1058)
>         at 
> org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java:61089)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2328)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:123)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:188)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:168)
> Caused by: 
> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via 
> ip-10-0-3-13.ec2.internal,16020,1508372563772:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
>  java.io.FileNotFoundException: File does not exist: 
> hdfs://ip-10-0-12-164.ec2.internal:8020/user/hbase/data/default/x_table/ecbb3aeaf7c5b1f65742deab5812362c/d/f76d8827c29244b99bf9344982956523
>         at 
> org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:83)
>         at 
> org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.rethrowExceptionIfFailed(TakeSnapshotHandler.java:315)
>         at 
> org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:344)
>         ... 6 more
> Caused by: 
> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: 
> java.io.FileNotFoundException: File does not exist: 
> hdfs://ip-10-0-12-164.ec2.internal:8020/user/hbase/data/default/x_table/ecbb3aeaf7c5b1f65742deab5812362c/d/f76d8827c29244b99bf9344982956523
>         at 
> org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager$SnapshotSubprocedurePool.waitForOutstandingTasks(RegionServerSnapshotManager.java:347)
>         at 
> org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure.flushSnapshot(FlushSnapshotSubprocedure.java:140)
>         at 
> org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure.insideBarrier(FlushSnapshotSubprocedure.java:160)
>         at 
> 

[jira] [Commented] (HBASE-19681) Online snapshot creation failing with missing store file

2018-01-05 Thread Anirban Roy (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16313972#comment-16313972
 ] 

Anirban Roy commented on HBASE-19681:
-

Also see the following exception in region server during compaction -

2018-01-05 13:31:55,910 ERROR 
[regionserver/ip-10-0-1-237.ec2.internal/10.0.1.237:16020-longCompactions-1508372592608]
 regionserver.CompactSplitThread: Compaction selection failed Store = d, pri = 5
java.io.FileNotFoundException: File does not exist: 
hdfs://ip-10-0-12-164.ec2.internal:8020/user/hbase/data/default/x_table/396a31774fbb8b8ed1020850e6035973/d/4a46f33587ae43d2986cbf0e45379c83
at 
org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:431)
at 
org.apache.hadoop.hbase.regionserver.StoreFileInfo.getReferencedFileStatus(StoreFileInfo.java:342)
at 
org.apache.hadoop.hbase.regionserver.StoreFileInfo.getFileStatus(StoreFileInfo.java:355)
at 
org.apache.hadoop.hbase.regionserver.StoreFileInfo.getModificationTime(StoreFileInfo.java:360)
at 
org.apache.hadoop.hbase.regionserver.StoreFile.getModificationTimeStamp(StoreFile.java:321)
at 
org.apache.hadoop.hbase.regionserver.StoreUtils.getLowestTimestamp(StoreUtils.java:63)
at 
org.apache.hadoop.hbase.regionserver.compactions.RatioBasedCompactionPolicy.shouldPerformMajorCompaction(RatioBasedCompactionPolicy.java:64)
at 
org.apache.hadoop.hbase.regionserver.compactions.SortedCompactionPolicy.selectCompaction(SortedCompactionPolicy.java:82)
at 
org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.select(DefaultStoreEngine.java:107)
at 
org.apache.hadoop.hbase.regionserver.HStore.requestCompaction(HStore.java:1661)
at 
org.apache.hadoop.hbase.regionserver.CompactSplitThread.selectCompaction(CompactSplitThread.java:369)
at 
org.apache.hadoop.hbase.regionserver.CompactSplitThread.access$100(CompactSplitThread.java:59)
at 
org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.doCompaction(CompactSplitThread.java:494)
at 
org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.run(CompactSplitThread.java:564)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


> Online snapshot creation failing with missing store file
> 
>
> Key: HBASE-19681
> URL: https://issues.apache.org/jira/browse/HBASE-19681
> Project: HBase
>  Issue Type: Bug
>  Components: backup, Performance, scaling, snapshots
>Affects Versions: 1.3.0
> Environment: Hadoop - 2.7.3
> HBase 1.3.0
> OS - GNU/Linux x86_64
> Cluster - Amazon Elastic Mapreduce
>Reporter: Anirban Roy
> Attachments: region-server-missing file-log.doc, 
> region-server-snapshot-exception-log.doc
>
>
> We are facing problem creating online snapshot of our HBase table. The table 
> contains 20TB data and receiving ~1 writes per second. The snapshot 
> creating failing intermittently with error that some hfile missing, see the 
> detailed output below. Once we locate the region server hosting the region 
> and restart the region server, snapshot creation succeeds. It seems the 
> missing hfile removed due to minor compaction, but region server still holds 
> the pointer to the file.
> [hadoop@ip-10-0-12-164 ~]$ hbase shell
> HBase Shell; enter 'help' for list of supported commands.
> Type "exit" to leave the HBase Shell
> Version 1.3.0, rUnknown, Fri Feb 17 18:15:07 UTC 2017
>  
> hbase(main):001:0> snapshot ‘x_table’, ‘x_snapshot’
>  
> ERROR: org.apache.hadoop.hbase.snapshot.HBaseSnapshotException: Snapshot { 
> ss=x_snapshot table=x_table type=FLUSH } had an error.  Procedure x_snapshot 
> { waiting=[] done=[ip-10-0-9-31.ec2.internal,16020,1508372578254, 
> ip-10-0-0-32.ec2.internal,16020,1508372591059, 
> ip-10-0-14-221.ec2.internal,16020,1508372580873, 
> ip-10-0-15-185.ec2.internal,16020,1508372588507, 
> ip-10-0-9-43.ec2.internal,16020,1508372569107, 
> ip-10-0-10-62.ec2.internal,16020,1512885921693, 
> ip-10-0-8-216.ec2.internal,16020,1508372584133, 
> ip-10-0-1-207.ec2.internal,16020,1508372580144, 
> ip-10-0-0-173.ec2.internal,16020,1508372584969, 
> ip-10-0-4-79.ec2.internal,16020,1508372587161, 
> 

[jira] [Commented] (HBASE-19681) Online snapshot creation failing with missing store file

2018-01-02 Thread Anirban Roy (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16309160#comment-16309160
 ] 

Anirban Roy commented on HBASE-19681:
-

I attached the region server log snippet where I found the reference of the 
missing HFile. Note the timestamp in three log statements which dealt with the 
file. Apart from the last ERROR level log, I did not find any other ERROR/WARN 
level statements in log for the region. Do you have any clue what might have 
gone wrong? If that missing file subsumed by subsequent file due to minor 
compaction, wouldn't there be a mention in the log?

I can't move to 1.4.0 now(considerable effort), but may consider once I know 
what is the real deal here.

> Online snapshot creation failing with missing store file
> 
>
> Key: HBASE-19681
> URL: https://issues.apache.org/jira/browse/HBASE-19681
> Project: HBase
>  Issue Type: Bug
>  Components: backup, snapshots
>Affects Versions: 1.3.0
> Environment: Hadoop - 2.7.3
> HBase 1.3.0
> OS - GNU/Linux x86_64
> Cluster - Amazon Elastic Mapreduce
>Reporter: Anirban Roy
> Attachments: region-server-missing file-log.doc
>
>
> We are facing problem creating online snapshot of our HBase table. The table 
> contains 20TB data and receiving ~1 writes per second. The snapshot 
> creating failing intermittently with error that some hfile missing, see the 
> detailed output below. Once we locate the region server hosting the region 
> and restart the region server, snapshot creation succeeds. It seems the 
> missing hfile removed due to minor compaction, but region server still holds 
> the pointer to the file.
> [hadoop@ip-10-0-12-164 ~]$ hbase shell
> HBase Shell; enter 'help' for list of supported commands.
> Type "exit" to leave the HBase Shell
> Version 1.3.0, rUnknown, Fri Feb 17 18:15:07 UTC 2017
>  
> hbase(main):001:0> snapshot ‘x_table’, ‘x_snapshot’
>  
> ERROR: org.apache.hadoop.hbase.snapshot.HBaseSnapshotException: Snapshot { 
> ss=x_snapshot table=x_table type=FLUSH } had an error.  Procedure x_snapshot 
> { waiting=[] done=[ip-10-0-9-31.ec2.internal,16020,1508372578254, 
> ip-10-0-0-32.ec2.internal,16020,1508372591059, 
> ip-10-0-14-221.ec2.internal,16020,1508372580873, 
> ip-10-0-15-185.ec2.internal,16020,1508372588507, 
> ip-10-0-9-43.ec2.internal,16020,1508372569107, 
> ip-10-0-10-62.ec2.internal,16020,1512885921693, 
> ip-10-0-8-216.ec2.internal,16020,1508372584133, 
> ip-10-0-1-207.ec2.internal,16020,1508372580144, 
> ip-10-0-0-173.ec2.internal,16020,1508372584969, 
> ip-10-0-4-79.ec2.internal,16020,1508372587161, 
> ip-10-0-3-165.ec2.internal,16020,1508372593566, 
> ip-10-0-14-137.ec2.internal,16020,1508372583225, 
> ip-10-0-6-33.ec2.internal,16020,1508372581587, 
> ip-10-0-15-199.ec2.internal,16020,1508372587478, 
> ip-10-0-5-253.ec2.internal,16020,1508372581243, 
> ip-10-0-1-99.ec2.internal,16020,1508372609684] }
>         at 
> org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:354)
>         at 
> org.apache.hadoop.hbase.master.MasterRpcServices.isSnapshotDone(MasterRpcServices.java:1058)
>         at 
> org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java:61089)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2328)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:123)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:188)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:168)
> Caused by: 
> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via 
> ip-10-0-3-13.ec2.internal,16020,1508372563772:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
>  java.io.FileNotFoundException: File does not exist: 
> hdfs://ip-10-0-12-164.ec2.internal:8020/user/hbase/data/default/x_table/ecbb3aeaf7c5b1f65742deab5812362c/d/f76d8827c29244b99bf9344982956523
>         at 
> org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:83)
>         at 
> org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.rethrowExceptionIfFailed(TakeSnapshotHandler.java:315)
>         at 
> org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:344)
>         ... 6 more
> Caused by: 
> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: 
> java.io.FileNotFoundException: File does not exist: 
> hdfs://ip-10-0-12-164.ec2.internal:8020/user/hbase/data/default/x_table/ecbb3aeaf7c5b1f65742deab5812362c/d/f76d8827c29244b99bf9344982956523
>         at 
> 

[jira] [Commented] (HBASE-19681) Online snapshot creation failing with missing store file

2018-01-01 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16307526#comment-16307526
 ] 

Ted Yu commented on HBASE-19681:


Can you upload region server log so that we know more about 
f76d8827c29244b99bf9344982956523 ?

If possible, please upgrade to 1.4.0 which has related fixes such as:

HBASE-19468 FNFE during scans and flushes

> Online snapshot creation failing with missing store file
> 
>
> Key: HBASE-19681
> URL: https://issues.apache.org/jira/browse/HBASE-19681
> Project: HBase
>  Issue Type: Bug
>  Components: backup, snapshots
>Affects Versions: 1.3.0
> Environment: Hadoop - 2.7.3
> HBase 1.3.0
> OS - GNU/Linux x86_64
> Cluster - Amazon Elastic Mapreduce
>Reporter: Anirban Roy
>
> We are facing problem creating online snapshot of our HBase table. The table 
> contains 20TB data and receiving ~1 writes per second. The snapshot 
> creating failing intermittently with error that some hfile missing, see the 
> detailed output below. Once we locate the region server hosting the region 
> and restart the region server, snapshot creation succeeds. It seems the 
> missing hfile removed due to minor compaction, but region server still holds 
> the pointer to the file.
> [hadoop@ip-10-0-12-164 ~]$ hbase shell
> HBase Shell; enter 'help' for list of supported commands.
> Type "exit" to leave the HBase Shell
> Version 1.3.0, rUnknown, Fri Feb 17 18:15:07 UTC 2017
>  
> hbase(main):001:0> snapshot ‘x_table’, ‘x_snapshot’
>  
> ERROR: org.apache.hadoop.hbase.snapshot.HBaseSnapshotException: Snapshot { 
> ss=x_snapshot table=x_table type=FLUSH } had an error.  Procedure x_snapshot 
> { waiting=[] done=[ip-10-0-9-31.ec2.internal,16020,1508372578254, 
> ip-10-0-0-32.ec2.internal,16020,1508372591059, 
> ip-10-0-14-221.ec2.internal,16020,1508372580873, 
> ip-10-0-15-185.ec2.internal,16020,1508372588507, 
> ip-10-0-9-43.ec2.internal,16020,1508372569107, 
> ip-10-0-10-62.ec2.internal,16020,1512885921693, 
> ip-10-0-8-216.ec2.internal,16020,1508372584133, 
> ip-10-0-1-207.ec2.internal,16020,1508372580144, 
> ip-10-0-0-173.ec2.internal,16020,1508372584969, 
> ip-10-0-4-79.ec2.internal,16020,1508372587161, 
> ip-10-0-3-165.ec2.internal,16020,1508372593566, 
> ip-10-0-14-137.ec2.internal,16020,1508372583225, 
> ip-10-0-6-33.ec2.internal,16020,1508372581587, 
> ip-10-0-15-199.ec2.internal,16020,1508372587478, 
> ip-10-0-5-253.ec2.internal,16020,1508372581243, 
> ip-10-0-1-99.ec2.internal,16020,1508372609684] }
>         at 
> org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:354)
>         at 
> org.apache.hadoop.hbase.master.MasterRpcServices.isSnapshotDone(MasterRpcServices.java:1058)
>         at 
> org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java:61089)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2328)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:123)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:188)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:168)
> Caused by: 
> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via 
> ip-10-0-3-13.ec2.internal,16020,1508372563772:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
>  java.io.FileNotFoundException: File does not exist: 
> hdfs://ip-10-0-12-164.ec2.internal:8020/user/hbase/data/default/x_table/ecbb3aeaf7c5b1f65742deab5812362c/d/f76d8827c29244b99bf9344982956523
>         at 
> org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:83)
>         at 
> org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.rethrowExceptionIfFailed(TakeSnapshotHandler.java:315)
>         at 
> org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:344)
>         ... 6 more
> Caused by: 
> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: 
> java.io.FileNotFoundException: File does not exist: 
> hdfs://ip-10-0-12-164.ec2.internal:8020/user/hbase/data/default/x_table/ecbb3aeaf7c5b1f65742deab5812362c/d/f76d8827c29244b99bf9344982956523
>         at 
> org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager$SnapshotSubprocedurePool.waitForOutstandingTasks(RegionServerSnapshotManager.java:347)
>         at 
> org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure.flushSnapshot(FlushSnapshotSubprocedure.java:140)
>         at 
> org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure.insideBarrier(FlushSnapshotSubprocedure.java:160)
>         at 
> 

[jira] [Commented] (HBASE-19681) Online snapshot creation failing with missing store file

2018-01-01 Thread Anirban Roy (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16307522#comment-16307522
 ] 

Anirban Roy commented on HBASE-19681:
-

We also want to know if there is any potential data loss due to this error. 
Looking at the region server log, we see a reference to the hfile, a few other 
hfiles compacted to this file but no reference that this particular hfile being 
compacted to newer hfile. But when we check HDFS, the file is really missing. 
Once the region server get restarted, it no more complains about the missing 
hfile. Hence, this is very important to know the behavior and any impact due 
that, before we get a fix here.

> Online snapshot creation failing with missing store file
> 
>
> Key: HBASE-19681
> URL: https://issues.apache.org/jira/browse/HBASE-19681
> Project: HBase
>  Issue Type: Bug
>  Components: backup, snapshots
>Affects Versions: 1.3.0
> Environment: Hadoop - 2.7.3
> HBase 1.3.0
> OS - GNU/Linux x86_64
> Cluster - Amazon Elastic Mapreduce
>Reporter: Anirban Roy
>
> We are facing problem creating online snapshot of our HBase table. The table 
> contains 20TB data and receiving ~1 writes per second. The snapshot 
> creating failing intermittently with error that some hfile missing, see the 
> detailed output below. Once we locate the region server hosting the region 
> and restart the region server, snapshot creation succeeds. It seems the 
> missing hfile removed due to minor compaction, but region server still holds 
> the pointer to the file.
> [hadoop@ip-10-0-12-164 ~]$ hbase shell
> HBase Shell; enter 'help' for list of supported commands.
> Type "exit" to leave the HBase Shell
> Version 1.3.0, rUnknown, Fri Feb 17 18:15:07 UTC 2017
>  
> hbase(main):001:0> snapshot ‘x_table’, ‘x_snapshot’
>  
> ERROR: org.apache.hadoop.hbase.snapshot.HBaseSnapshotException: Snapshot { 
> ss=x_snapshot table=x_table type=FLUSH } had an error.  Procedure x_snapshot 
> { waiting=[] done=[ip-10-0-9-31.ec2.internal,16020,1508372578254, 
> ip-10-0-0-32.ec2.internal,16020,1508372591059, 
> ip-10-0-14-221.ec2.internal,16020,1508372580873, 
> ip-10-0-15-185.ec2.internal,16020,1508372588507, 
> ip-10-0-9-43.ec2.internal,16020,1508372569107, 
> ip-10-0-10-62.ec2.internal,16020,1512885921693, 
> ip-10-0-8-216.ec2.internal,16020,1508372584133, 
> ip-10-0-1-207.ec2.internal,16020,1508372580144, 
> ip-10-0-0-173.ec2.internal,16020,1508372584969, 
> ip-10-0-4-79.ec2.internal,16020,1508372587161, 
> ip-10-0-3-165.ec2.internal,16020,1508372593566, 
> ip-10-0-14-137.ec2.internal,16020,1508372583225, 
> ip-10-0-6-33.ec2.internal,16020,1508372581587, 
> ip-10-0-15-199.ec2.internal,16020,1508372587478, 
> ip-10-0-5-253.ec2.internal,16020,1508372581243, 
> ip-10-0-1-99.ec2.internal,16020,1508372609684] }
>         at 
> org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:354)
>         at 
> org.apache.hadoop.hbase.master.MasterRpcServices.isSnapshotDone(MasterRpcServices.java:1058)
>         at 
> org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java:61089)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2328)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:123)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:188)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:168)
> Caused by: 
> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via 
> ip-10-0-3-13.ec2.internal,16020,1508372563772:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
>  java.io.FileNotFoundException: File does not exist: 
> hdfs://ip-10-0-12-164.ec2.internal:8020/user/hbase/data/default/x_table/ecbb3aeaf7c5b1f65742deab5812362c/d/f76d8827c29244b99bf9344982956523
>         at 
> org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:83)
>         at 
> org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.rethrowExceptionIfFailed(TakeSnapshotHandler.java:315)
>         at 
> org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:344)
>         ... 6 more
> Caused by: 
> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: 
> java.io.FileNotFoundException: File does not exist: 
> hdfs://ip-10-0-12-164.ec2.internal:8020/user/hbase/data/default/x_table/ecbb3aeaf7c5b1f65742deab5812362c/d/f76d8827c29244b99bf9344982956523
>         at 
> org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager$SnapshotSubprocedurePool.waitForOutstandingTasks(RegionServerSnapshotManager.java:347)
>         at 
>