The latest log is very interesting Tianying, but I don't see how is related to the initial CorruptedSnapshotException since ZKProcedureMemberRpcs is aborting the operation due a timeout. Could be possible for you to upload the HBase master and region server logs to pastebin or some other site? just around the time you started the snapshot and it failed should be fine.
thanks, esteban. -- Cloudera, Inc. On Tue, May 19, 2015 at 1:45 PM, Tianying Chang <[email protected]> wrote: > Matteo > > > By looking at the DEBUG log at RS side, it seems to me that no online > regions were pickedup. So it seems to me that this call returns 0 regions. > But I am not sure how that happens. Is there anyway to verify this? > > involvedRegions = getRegionsToSnapshot(snapshot); > > 2015-05-19 20:36:22,223 DEBUG > org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager: > cancelling 0 tasks for snapshot > > > > Thanks > Tian-Ying > > Full log: > > 2015-05-19 20:35:46,684 INFO > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Received procedure > start children changed event: /hbase/online-snapshot/acquired > 2015-05-19 20:35:46,686 DEBUG > org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager: > Launching subprocedure for snapshot ss_rich_pin_data_v1 from table > rich_pin_data_v1 > 2015-05-19 20:36:21,723 INFO > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Received created > event:/hbase/online-snapshot/abort/ss_rich_pin_data_v1 > 2015-05-19 20:36:21,724 ERROR > org.apache.hadoop.hbase.procedure.ProcedureMember: Propagating foreign > exception to subprocedure ss_rich_pin_data_v1 > org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via > timer-java.util.Timer@5add830c > :org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: > org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! > Source:Timeout caused Foreign Exception Start:1432067746683, > End:1432067781681, diff:34998, max:35000 ms > at > > org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171) > at > > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320) > at > > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeCreated(ZKProcedureMemberRpcs.java:95) > at > > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:290) > at > > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) > Caused by: > org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: > org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! > Source:Timeout caused Foreign Exception Start:1432067746683, > End:1432067781681, diff:34998, max:35000 ms > at > > org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:71) > at java.util.TimerThread.mainLoop(Timer.java:512) > at java.util.TimerThread.run(Timer.java:462) > 2015-05-19 20:36:21,724 INFO > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Received procedure > abort children changed event: /hbase/online-snapshot/abort > 2015-05-19 20:36:21,726 ERROR > org.apache.hadoop.hbase.procedure.ProcedureMember: Propagating foreign > exception to subprocedure ss_rich_pin_data_v1 > org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via > timer-java.util.Timer@5add830c > :org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: > org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! > Source:Timeout caused Foreign Exception Start:1432067746683, > End:1432067781681, diff:34998, max:35000 ms > at > > org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171) > at > > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320) > at > > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.watchForAbortedProcedures(ZKProcedureMemberRpcs.java:143) > at > > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.access$200(ZKProcedureMemberRpcs.java:56) > at > > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeChildrenChanged(ZKProcedureMemberRpcs.java:111) > at > > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:311) > at > > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) > Caused by: > org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: > org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! > Source:Timeout caused Foreign Exception Start:1432067746683, > End:1432067781681, diff:34998, max:35000 ms > at > > org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:71) > at java.util.TimerThread.mainLoop(Timer.java:512) > at java.util.TimerThread.run(Timer.java:462) > 2015-05-19 20:36:21,780 INFO > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Received procedure > start children changed event: /hbase/online-snapshot/acquired > 2015-05-19 20:36:21,784 INFO > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Received procedure > abort children changed event: /hbase/online-snapshot/abort > 2015-05-19 20:36:22,221 ERROR > org.apache.hadoop.hbase.procedure.Subprocedure: Subprocedure > 'ss_rich_pin_data_v1' aborting due to a ForeignException! > org.apache.hadoop.hbase.errorhandling.TimeoutException via > timer-java.util.Timer@25ff6d2a > :org.apache.hadoop.hbase.errorhandling.TimeoutException: > Timeout elapsed! Source:Timeout caused Foreign Exception > Start:1432067746687, End:1432067781721, diff:35034, max:35000 ms > at > > org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:85) > at > > org.apache.hadoop.hbase.procedure.Procedure.waitForLatch(Procedure.java:369) > at > > org.apache.hadoop.hbase.procedure.Subprocedure.waitForReachedGlobalBarrier(Subprocedure.java:296) > at > org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:170) > at > org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:52) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) > at java.lang.Thread.run(Thread.java:662) > Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout > elapsed! Source:Timeout caused Foreign Exception Start:1432067746687, > End:1432067781721, diff:35034, max:35000 ms > at > > org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:68) > at java.util.TimerThread.mainLoop(Timer.java:512) > at java.util.TimerThread.run(Timer.java:462) > 2015-05-19 20:36:22,223 ERROR > org.apache.hadoop.hbase.procedure.Subprocedure: Subprocedure > 'ss_rich_pin_data_v1' aborting due to a ForeignException! > org.apache.hadoop.hbase.errorhandling.TimeoutException via > timer-java.util.Timer@25ff6d2a > :org.apache.hadoop.hbase.errorhandling.TimeoutException: > Timeout elapsed! Source:Timeout caused Foreign Exception > Start:1432067746687, End:1432067781721, diff:35034, max:35000 ms > at > > org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:85) > at > > org.apache.hadoop.hbase.procedure.Procedure.waitForLatch(Procedure.java:369) > at > > org.apache.hadoop.hbase.procedure.Subprocedure.waitForReachedGlobalBarrier(Subprocedure.java:296) > at > org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:170) > at > org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:52) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) > at java.lang.Thread.run(Thread.java:662) > Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout > elapsed! Source:Timeout caused Foreign Exception Start:1432067746687, > End:1432067781721, diff:35034, max:35000 ms > at > > org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:68) > at java.util.TimerThread.mainLoop(Timer.java:512) > at java.util.TimerThread.run(Timer.java:462) > 2015-05-19 20:36:22,223 INFO > org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure: > Aborting all online FLUSH snapshot subprocedure task threads for > 'ss_rich_pin_data_v1' due to error > org.apache.hadoop.hbase.errorhandling.TimeoutException via > timer-java.util.Timer@25ff6d2a > :org.apache.hadoop.hbase.errorhandling.TimeoutException: > Timeout elapsed! Source:Timeout caused Foreign Exception > Start:1432067746687, End:1432067781721, diff:35034, max:35000 ms > at > > org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:85) > at > > org.apache.hadoop.hbase.procedure.Procedure.waitForLatch(Procedure.java:369) > at > > org.apache.hadoop.hbase.procedure.Subprocedure.waitForReachedGlobalBarrier(Subprocedure.java:296) > at > org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:170) > at > org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:52) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) > at java.lang.Thread.run(Thread.java:662) > Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout > elapsed! Source:Timeout caused Foreign Exception Start:1432067746687, > End:1432067781721, diff:35034, max:35000 ms > at > > org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:68) > at java.util.TimerThread.mainLoop(Timer.java:512) > at java.util.TimerThread.run(Timer.java:462) > 2015-05-19 20:36:22,223 DEBUG > org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager: > cancelling 0 tasks for snapshot > > On Tue, May 19, 2015 at 1:30 PM, Tianying Chang <[email protected]> wrote: > > > Matteo > > > > We are using hdfs2.0 + HBase94.7. > > > > I saw this ArrayIndexOutOfBoundsException: 2 error also. What does that > > mean? > > > > BTW, other tables (but those are smaller in terms of region count) in > this > > same cluster is able to create snapshot, only this table is failing. > > > > Thanks > > Tian-Ying > > > > On Tue, May 19, 2015 at 11:50 AM, Matteo Bertozzi < > [email protected] > > > wrote: > > > >> can you debug the protobuf problem, I think we abort because we are not > >> able to write > >> > >> 2015-05-19 06:00:49,745 WARN org.apache.hadoop.ipc.HBaseServer: IPC > Server > >> handler 50 on 60000 caught: java.lang.ArrayIndexOutOfBoundsException: 2 > >> at java.util.Arrays$ArrayList.get(Arrays.java:3381) > >> at > >> java.util.Collections$UnmodifiableList.get(Collections.java:1152) > >> at > >> > >> > org.apache.hadoop.hbase.protobuf.generated.HBaseProtos$SnapshotDescription$Type.getValueDescriptor(HBaseProtos.java:99) > >> ... > >> com.google.protobuf.AbstractMessage.toString(AbstractMessage.java:86) > >> at > >> > >> > org.apache.hadoop.hbase.snapshot.HSnapshotDescription.toString(HSnapshotDescription.java:72) > >> at java.lang.String.valueOf(String.java:2826) > >> at java.lang.StringBuilder.append(StringBuilder.java:115) > >> at > >> org.apache.hadoop.hbase.ipc.Invocation.toString(Invocation.java:152) > >> at > >> > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Call.toString(HBaseServer.java:304) > >> > >> Matteo > >> > >> > >> On Tue, May 19, 2015 at 11:35 AM, Tianying Chang <[email protected]> > >> wrote: > >> > >> > Actually, I find it does not even print out the debug info below for > >> this > >> > table, other table will print out this logging. So it seems it did not > >> > invoke the FlushSnapshotSubprocedure at all. > >> > > >> > > >> > @Override > >> > public Void call() throws Exception { > >> > // Taking the region read lock prevents the individual region > from > >> > being closed while a > >> > // snapshot is in progress. This is helpful but not sufficient > >> for > >> > preventing races with > >> > // snapshots that involve multiple regions and regionservers. > It > >> is > >> > still possible to have > >> > // an interleaving such that globally regions are missing, so we > >> > still need the verification > >> > // step. > >> > LOG.debug("Starting region operation on " + region); > >> > > >> > On Tue, May 19, 2015 at 11:26 AM, Tianying Chang <[email protected]> > >> > wrote: > >> > > >> > > Hi, Esteban, > >> > > > >> > > There is no region split in this cluster, since we put the region > size > >> > > upper bound to be really high to prevent splitting. > >> > > > >> > > I think it happens for all the regions of this table. > >> > > > >> > > I repeatedly run "hdfs dfs -lsr > >> > > /hbase/.hbase-snapshot/ss_rich_pin_data_v1" while taking snapshot, > no > >> > > region was able to write into this direction. I also turn on DEBUG > >> > logging > >> > > on RS, all RS just report fail with Timeout, with no specific > reason. > >> > > > >> > > Thanks > >> > > Tian-Ying > >> > > > >> > > On Tue, May 19, 2015 at 11:06 AM, Esteban Gutierrez < > >> > [email protected]> > >> > > wrote: > >> > > > >> > >> Hi Tianying, > >> > >> > >> > >> Is this happening consistently in this region or is it happening > >> > randomly > >> > >> across other regions too? One possibility is that there was a split > >> > going > >> > >> on at the time you started to take the snapshot and it failed. If > you > >> > look > >> > >> into /hbase/rich_pin_data_v1 can you find a directory named > >> > >> dff681880bb2b23d0351d6656a1dbbb9 in there? > >> > >> > >> > >> cheers, > >> > >> esteban. > >> > >> > >> > >> > >> > >> -- > >> > >> Cloudera, Inc. > >> > >> > >> > >> > >> > >> On Mon, May 18, 2015 at 11:12 PM, Tianying Chang < > [email protected]> > >> > >> wrote: > >> > >> > >> > >> > Hi, > >> > >> > > >> > >> > We have a cluster that used to be able to take snapshot. But > >> recently, > >> > >> one > >> > >> > table failed due to the error below. Other tables on the same > >> clusters > >> > >> are > >> > >> > fine. > >> > >> > > >> > >> > Any idea what could go wrong? Is the table not healthy? But I run > >> > hbase > >> > >> > hbck, it reports cluster healthy. > >> > >> > > >> > >> > BTW, we are running 94.7, so we need to take snapshot of the data > >> to > >> > >> export > >> > >> > to a new cluster of 94.26 as part of upgrade (and eventually > >> upgrade > >> > to > >> > >> > 1.x) > >> > >> > > >> > >> > Thanks > >> > >> > Tian-Ying > >> > >> > > >> > >> > > >> > >> > 015-05-19 06:00:45,505 ERROR > >> > >> > org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler: > Failed > >> > >> taking > >> > >> > snapshot { ss=ss_rich_pin_data_v1 table=rich_pin_data_v1 > >> > type=SKIPFLUSH > >> > >> } > >> > >> > due to exception:No region directory found for region:{NAME => > >> > >> > > >> 'rich_pin_data_v1,,1389319134976.dff681880bb2b23d0351d6656a1dbbb9.', > >> > >> > STARTKEY => '', ENDKEY => '001ff3a165ff571471603035ca7b4be9', > >> ENCODED > >> > => > >> > >> > dff681880bb2b23d0351d6656a1dbbb9,} > >> > >> > org.apache.hadoop.hbase.snapshot.CorruptedSnapshotException: No > >> region > >> > >> > directory found for region:{NAME => > >> > >> > > >> 'rich_pin_data_v1,,1389319134976.dff681880bb2b23d0351d6656a1dbbb9.', > >> > >> > STARTKEY => '', ENDKEY => '001ff3a165ff571471603035ca7b4be9', > >> ENCODED > >> > => > >> > >> > dff681880bb2b23d0351d6656a1dbbb9,} > >> > >> > at > >> > >> > > >> > >> > > >> > >> > >> > > >> > org.apache.hadoop.hbase.master.snapshot.MasterSnapshotVerifier.verifyRegion(MasterSnapshotVerifier.java:167) > >> > >> > at > >> > >> > > >> > >> > > >> > >> > >> > > >> > org.apache.hadoop.hbase.master.snapshot.MasterSnapshotVerifier.verifyRegions(MasterSnapshotVerifier.java:152) > >> > >> > at > >> > >> > > >> > >> > > >> > >> > >> > > >> > org.apache.hadoop.hbase.master.snapshot.MasterSnapshotVerifier.verifySnapshot(MasterSnapshotVerifier.java:115) > >> > >> > at > >> > >> > > >> > >> > > >> > >> > >> > > >> > org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.process(TakeSnapshotHandler.java:156) > >> > >> > at > >> > >> > > >> > > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175) > >> > >> > at > >> > >> > > >> > >> > > >> > >> > >> > > >> > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > >> > >> > at > >> > >> > > >> > >> > > >> > >> > >> > > >> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > >> > >> > at java.lang.Thread.run(Thread.java:662) > >> > >> > 2015-05-19 06:00:45,505 INFO > >> > >> > org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler: Stop > >> > taking > >> > >> > snapshot={ ss=ss_rich_pin_data_v1 table=rich_pin_data_v1 > >> > type=SKIPFLUSH > >> > >> } > >> > >> > because: Failed to take snapshot '{ ss=ss_rich_pin_data_v1 > >> > >> > table=rich_pin_data_v1 type=SKIPFLUSH }' due to exception > >> > >> > 2015-05-19 06:00:49,745 WARN org.apache.hadoop.ipc.HBaseServer: > IPC > >> > >> Server > >> > >> > handler 50 on 60000 caught: > >> java.lang.ArrayIndexOutOfBoundsException: > >> > 2 > >> > >> > at java.util.Arrays$ArrayList.get(Arrays.java:3381) > >> > >> > at > >> > >> > java.util.Collections$UnmodifiableList.get(Collections.java:1152) > >> > >> > at > >> > >> > > >> > >> > > >> > >> > >> > > >> > org.apache.hadoop.hbase.protobuf.generated.HBaseProtos$SnapshotDescription$Type.getValueDescriptor(HBaseProtos.java:99) > >> > >> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native > >> Method) > >> > >> > at > >> > >> > > >> > >> > > >> > >> > >> > > >> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > >> > >> > at > >> > >> > > >> > >> > > >> > >> > >> > > >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > >> > >> > at java.lang.reflect.Method.invoke(Method.java:597) > >> > >> > at > >> > >> > > >> > >> > > >> > >> > >> > > >> > com.google.protobuf.GeneratedMessage.invokeOrDie(GeneratedMessage.java:1369) > >> > >> > at > >> > >> > > >> > >> > >> > > >> > com.google.protobuf.GeneratedMessage.access$1400(GeneratedMessage.java:57) > >> > >> > at > >> > >> > > >> > >> > > >> > >> > >> > > >> > com.google.protobuf.GeneratedMessage$FieldAccessorTable$SingularEnumFieldAccessor.get(GeneratedMessage.java:1670) > >> > >> > at > >> > >> > > >> > > com.google.protobuf.GeneratedMessage.getField(GeneratedMessage.java:162) > >> > >> > at > >> > >> > > >> > >> > > >> > >> > >> > > >> > com.google.protobuf.GeneratedMessage.getAllFieldsMutable(GeneratedMessage.java:113) > >> > >> > at > >> > >> > > >> > >> > > >> > >> > >> > > >> > com.google.protobuf.GeneratedMessage.getAllFields(GeneratedMessage.java:152) > >> > >> > at > >> > >> > com.google.protobuf.TextFormat$Printer.print(TextFormat.java:228) > >> > >> > at > >> > >> > > >> com.google.protobuf.TextFormat$Printer.access$200(TextFormat.java:217) > >> > >> > at > com.google.protobuf.TextFormat.print(TextFormat.java:68) > >> > >> > at > >> > >> > com.google.protobuf.TextFormat.printToString(TextFormat.java:115) > >> > >> > at > >> > >> > > >> com.google.protobuf.AbstractMessage.toString(AbstractMessage.java:86) > >> > >> > at > >> > >> > > >> > >> > > >> > >> > >> > > >> > org.apache.hadoop.hbase.snapshot.HSnapshotDescription.toString(HSnapshotDescription.java:72) > >> > >> > at java.lang.String.valueOf(String.java:2826) > >> > >> > at java.lang.StringBuilder.append(StringBuilder.java:115) > >> > >> > at > >> > >> > > >> org.apache.hadoop.hbase.ipc.Invocation.toString(Invocation.java:152) > >> > >> > at > >> > >> > > >> > >> > >> > > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Call.toString(HBaseServer.java:304) > >> > >> > at java.lang.String.valueOf(String.java:2826) > >> > >> > at java.lang.StringBuilder.append(StringBuilder.java:115) > >> > >> > > >> > >> > >> > > > >> > > > >> > > >> > > > > >
