Matteo
By looking at the DEBUG log at RS side, it seems to me that no online regions were pickedup. So it seems to me that this call returns 0 regions. But I am not sure how that happens. Is there anyway to verify this? involvedRegions = getRegionsToSnapshot(snapshot); 2015-05-19 20:36:22,223 DEBUG org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager: cancelling 0 tasks for snapshot Thanks Tian-Ying Full log: 2015-05-19 20:35:46,684 INFO org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Received procedure start children changed event: /hbase/online-snapshot/acquired 2015-05-19 20:35:46,686 DEBUG org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager: Launching subprocedure for snapshot ss_rich_pin_data_v1 from table rich_pin_data_v1 2015-05-19 20:36:21,723 INFO org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Received created event:/hbase/online-snapshot/abort/ss_rich_pin_data_v1 2015-05-19 20:36:21,724 ERROR org.apache.hadoop.hbase.procedure.ProcedureMember: Propagating foreign exception to subprocedure ss_rich_pin_data_v1 org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via timer-java.util.Timer@5add830c:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1432067746683, End:1432067781681, diff:34998, max:35000 ms at org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171) at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320) at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeCreated(ZKProcedureMemberRpcs.java:95) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:290) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) Caused by: org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1432067746683, End:1432067781681, diff:34998, max:35000 ms at org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:71) at java.util.TimerThread.mainLoop(Timer.java:512) at java.util.TimerThread.run(Timer.java:462) 2015-05-19 20:36:21,724 INFO org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Received procedure abort children changed event: /hbase/online-snapshot/abort 2015-05-19 20:36:21,726 ERROR org.apache.hadoop.hbase.procedure.ProcedureMember: Propagating foreign exception to subprocedure ss_rich_pin_data_v1 org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via timer-java.util.Timer@5add830c:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1432067746683, End:1432067781681, diff:34998, max:35000 ms at org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171) at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320) at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.watchForAbortedProcedures(ZKProcedureMemberRpcs.java:143) at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.access$200(ZKProcedureMemberRpcs.java:56) at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeChildrenChanged(ZKProcedureMemberRpcs.java:111) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:311) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) Caused by: org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1432067746683, End:1432067781681, diff:34998, max:35000 ms at org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:71) at java.util.TimerThread.mainLoop(Timer.java:512) at java.util.TimerThread.run(Timer.java:462) 2015-05-19 20:36:21,780 INFO org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Received procedure start children changed event: /hbase/online-snapshot/acquired 2015-05-19 20:36:21,784 INFO org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Received procedure abort children changed event: /hbase/online-snapshot/abort 2015-05-19 20:36:22,221 ERROR org.apache.hadoop.hbase.procedure.Subprocedure: Subprocedure 'ss_rich_pin_data_v1' aborting due to a ForeignException! org.apache.hadoop.hbase.errorhandling.TimeoutException via timer-java.util.Timer@25ff6d2a:org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1432067746687, End:1432067781721, diff:35034, max:35000 ms at org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:85) at org.apache.hadoop.hbase.procedure.Procedure.waitForLatch(Procedure.java:369) at org.apache.hadoop.hbase.procedure.Subprocedure.waitForReachedGlobalBarrier(Subprocedure.java:296) at org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:170) at org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:52) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:662) Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1432067746687, End:1432067781721, diff:35034, max:35000 ms at org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:68) at java.util.TimerThread.mainLoop(Timer.java:512) at java.util.TimerThread.run(Timer.java:462) 2015-05-19 20:36:22,223 ERROR org.apache.hadoop.hbase.procedure.Subprocedure: Subprocedure 'ss_rich_pin_data_v1' aborting due to a ForeignException! org.apache.hadoop.hbase.errorhandling.TimeoutException via timer-java.util.Timer@25ff6d2a:org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1432067746687, End:1432067781721, diff:35034, max:35000 ms at org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:85) at org.apache.hadoop.hbase.procedure.Procedure.waitForLatch(Procedure.java:369) at org.apache.hadoop.hbase.procedure.Subprocedure.waitForReachedGlobalBarrier(Subprocedure.java:296) at org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:170) at org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:52) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:662) Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1432067746687, End:1432067781721, diff:35034, max:35000 ms at org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:68) at java.util.TimerThread.mainLoop(Timer.java:512) at java.util.TimerThread.run(Timer.java:462) 2015-05-19 20:36:22,223 INFO org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure: Aborting all online FLUSH snapshot subprocedure task threads for 'ss_rich_pin_data_v1' due to error org.apache.hadoop.hbase.errorhandling.TimeoutException via timer-java.util.Timer@25ff6d2a:org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1432067746687, End:1432067781721, diff:35034, max:35000 ms at org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:85) at org.apache.hadoop.hbase.procedure.Procedure.waitForLatch(Procedure.java:369) at org.apache.hadoop.hbase.procedure.Subprocedure.waitForReachedGlobalBarrier(Subprocedure.java:296) at org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:170) at org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:52) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:662) Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused Foreign Exception Start:1432067746687, End:1432067781721, diff:35034, max:35000 ms at org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:68) at java.util.TimerThread.mainLoop(Timer.java:512) at java.util.TimerThread.run(Timer.java:462) 2015-05-19 20:36:22,223 DEBUG org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager: cancelling 0 tasks for snapshot On Tue, May 19, 2015 at 1:30 PM, Tianying Chang <[email protected]> wrote: > Matteo > > We are using hdfs2.0 + HBase94.7. > > I saw this ArrayIndexOutOfBoundsException: 2 error also. What does that > mean? > > BTW, other tables (but those are smaller in terms of region count) in this > same cluster is able to create snapshot, only this table is failing. > > Thanks > Tian-Ying > > On Tue, May 19, 2015 at 11:50 AM, Matteo Bertozzi <[email protected] > > wrote: > >> can you debug the protobuf problem, I think we abort because we are not >> able to write >> >> 2015-05-19 06:00:49,745 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server >> handler 50 on 60000 caught: java.lang.ArrayIndexOutOfBoundsException: 2 >> at java.util.Arrays$ArrayList.get(Arrays.java:3381) >> at >> java.util.Collections$UnmodifiableList.get(Collections.java:1152) >> at >> >> org.apache.hadoop.hbase.protobuf.generated.HBaseProtos$SnapshotDescription$Type.getValueDescriptor(HBaseProtos.java:99) >> ... >> com.google.protobuf.AbstractMessage.toString(AbstractMessage.java:86) >> at >> >> org.apache.hadoop.hbase.snapshot.HSnapshotDescription.toString(HSnapshotDescription.java:72) >> at java.lang.String.valueOf(String.java:2826) >> at java.lang.StringBuilder.append(StringBuilder.java:115) >> at >> org.apache.hadoop.hbase.ipc.Invocation.toString(Invocation.java:152) >> at >> >> org.apache.hadoop.hbase.ipc.HBaseServer$Call.toString(HBaseServer.java:304) >> >> Matteo >> >> >> On Tue, May 19, 2015 at 11:35 AM, Tianying Chang <[email protected]> >> wrote: >> >> > Actually, I find it does not even print out the debug info below for >> this >> > table, other table will print out this logging. So it seems it did not >> > invoke the FlushSnapshotSubprocedure at all. >> > >> > >> > @Override >> > public Void call() throws Exception { >> > // Taking the region read lock prevents the individual region from >> > being closed while a >> > // snapshot is in progress. This is helpful but not sufficient >> for >> > preventing races with >> > // snapshots that involve multiple regions and regionservers. It >> is >> > still possible to have >> > // an interleaving such that globally regions are missing, so we >> > still need the verification >> > // step. >> > LOG.debug("Starting region operation on " + region); >> > >> > On Tue, May 19, 2015 at 11:26 AM, Tianying Chang <[email protected]> >> > wrote: >> > >> > > Hi, Esteban, >> > > >> > > There is no region split in this cluster, since we put the region size >> > > upper bound to be really high to prevent splitting. >> > > >> > > I think it happens for all the regions of this table. >> > > >> > > I repeatedly run "hdfs dfs -lsr >> > > /hbase/.hbase-snapshot/ss_rich_pin_data_v1" while taking snapshot, no >> > > region was able to write into this direction. I also turn on DEBUG >> > logging >> > > on RS, all RS just report fail with Timeout, with no specific reason. >> > > >> > > Thanks >> > > Tian-Ying >> > > >> > > On Tue, May 19, 2015 at 11:06 AM, Esteban Gutierrez < >> > [email protected]> >> > > wrote: >> > > >> > >> Hi Tianying, >> > >> >> > >> Is this happening consistently in this region or is it happening >> > randomly >> > >> across other regions too? One possibility is that there was a split >> > going >> > >> on at the time you started to take the snapshot and it failed. If you >> > look >> > >> into /hbase/rich_pin_data_v1 can you find a directory named >> > >> dff681880bb2b23d0351d6656a1dbbb9 in there? >> > >> >> > >> cheers, >> > >> esteban. >> > >> >> > >> >> > >> -- >> > >> Cloudera, Inc. >> > >> >> > >> >> > >> On Mon, May 18, 2015 at 11:12 PM, Tianying Chang <[email protected]> >> > >> wrote: >> > >> >> > >> > Hi, >> > >> > >> > >> > We have a cluster that used to be able to take snapshot. But >> recently, >> > >> one >> > >> > table failed due to the error below. Other tables on the same >> clusters >> > >> are >> > >> > fine. >> > >> > >> > >> > Any idea what could go wrong? Is the table not healthy? But I run >> > hbase >> > >> > hbck, it reports cluster healthy. >> > >> > >> > >> > BTW, we are running 94.7, so we need to take snapshot of the data >> to >> > >> export >> > >> > to a new cluster of 94.26 as part of upgrade (and eventually >> upgrade >> > to >> > >> > 1.x) >> > >> > >> > >> > Thanks >> > >> > Tian-Ying >> > >> > >> > >> > >> > >> > 015-05-19 06:00:45,505 ERROR >> > >> > org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler: Failed >> > >> taking >> > >> > snapshot { ss=ss_rich_pin_data_v1 table=rich_pin_data_v1 >> > type=SKIPFLUSH >> > >> } >> > >> > due to exception:No region directory found for region:{NAME => >> > >> > >> 'rich_pin_data_v1,,1389319134976.dff681880bb2b23d0351d6656a1dbbb9.', >> > >> > STARTKEY => '', ENDKEY => '001ff3a165ff571471603035ca7b4be9', >> ENCODED >> > => >> > >> > dff681880bb2b23d0351d6656a1dbbb9,} >> > >> > org.apache.hadoop.hbase.snapshot.CorruptedSnapshotException: No >> region >> > >> > directory found for region:{NAME => >> > >> > >> 'rich_pin_data_v1,,1389319134976.dff681880bb2b23d0351d6656a1dbbb9.', >> > >> > STARTKEY => '', ENDKEY => '001ff3a165ff571471603035ca7b4be9', >> ENCODED >> > => >> > >> > dff681880bb2b23d0351d6656a1dbbb9,} >> > >> > at >> > >> > >> > >> > >> > >> >> > >> org.apache.hadoop.hbase.master.snapshot.MasterSnapshotVerifier.verifyRegion(MasterSnapshotVerifier.java:167) >> > >> > at >> > >> > >> > >> > >> > >> >> > >> org.apache.hadoop.hbase.master.snapshot.MasterSnapshotVerifier.verifyRegions(MasterSnapshotVerifier.java:152) >> > >> > at >> > >> > >> > >> > >> > >> >> > >> org.apache.hadoop.hbase.master.snapshot.MasterSnapshotVerifier.verifySnapshot(MasterSnapshotVerifier.java:115) >> > >> > at >> > >> > >> > >> > >> > >> >> > >> org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.process(TakeSnapshotHandler.java:156) >> > >> > at >> > >> > >> > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175) >> > >> > at >> > >> > >> > >> > >> > >> >> > >> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >> > >> > at >> > >> > >> > >> > >> > >> >> > >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >> > >> > at java.lang.Thread.run(Thread.java:662) >> > >> > 2015-05-19 06:00:45,505 INFO >> > >> > org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler: Stop >> > taking >> > >> > snapshot={ ss=ss_rich_pin_data_v1 table=rich_pin_data_v1 >> > type=SKIPFLUSH >> > >> } >> > >> > because: Failed to take snapshot '{ ss=ss_rich_pin_data_v1 >> > >> > table=rich_pin_data_v1 type=SKIPFLUSH }' due to exception >> > >> > 2015-05-19 06:00:49,745 WARN org.apache.hadoop.ipc.HBaseServer: IPC >> > >> Server >> > >> > handler 50 on 60000 caught: >> java.lang.ArrayIndexOutOfBoundsException: >> > 2 >> > >> > at java.util.Arrays$ArrayList.get(Arrays.java:3381) >> > >> > at >> > >> > java.util.Collections$UnmodifiableList.get(Collections.java:1152) >> > >> > at >> > >> > >> > >> > >> > >> >> > >> org.apache.hadoop.hbase.protobuf.generated.HBaseProtos$SnapshotDescription$Type.getValueDescriptor(HBaseProtos.java:99) >> > >> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native >> Method) >> > >> > at >> > >> > >> > >> > >> > >> >> > >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> > >> > at >> > >> > >> > >> > >> > >> >> > >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> > >> > at java.lang.reflect.Method.invoke(Method.java:597) >> > >> > at >> > >> > >> > >> > >> > >> >> > >> com.google.protobuf.GeneratedMessage.invokeOrDie(GeneratedMessage.java:1369) >> > >> > at >> > >> > >> > >> >> > >> com.google.protobuf.GeneratedMessage.access$1400(GeneratedMessage.java:57) >> > >> > at >> > >> > >> > >> > >> > >> >> > >> com.google.protobuf.GeneratedMessage$FieldAccessorTable$SingularEnumFieldAccessor.get(GeneratedMessage.java:1670) >> > >> > at >> > >> > >> > com.google.protobuf.GeneratedMessage.getField(GeneratedMessage.java:162) >> > >> > at >> > >> > >> > >> > >> > >> >> > >> com.google.protobuf.GeneratedMessage.getAllFieldsMutable(GeneratedMessage.java:113) >> > >> > at >> > >> > >> > >> > >> > >> >> > >> com.google.protobuf.GeneratedMessage.getAllFields(GeneratedMessage.java:152) >> > >> > at >> > >> > com.google.protobuf.TextFormat$Printer.print(TextFormat.java:228) >> > >> > at >> > >> > >> com.google.protobuf.TextFormat$Printer.access$200(TextFormat.java:217) >> > >> > at com.google.protobuf.TextFormat.print(TextFormat.java:68) >> > >> > at >> > >> > com.google.protobuf.TextFormat.printToString(TextFormat.java:115) >> > >> > at >> > >> > >> com.google.protobuf.AbstractMessage.toString(AbstractMessage.java:86) >> > >> > at >> > >> > >> > >> > >> > >> >> > >> org.apache.hadoop.hbase.snapshot.HSnapshotDescription.toString(HSnapshotDescription.java:72) >> > >> > at java.lang.String.valueOf(String.java:2826) >> > >> > at java.lang.StringBuilder.append(StringBuilder.java:115) >> > >> > at >> > >> > >> org.apache.hadoop.hbase.ipc.Invocation.toString(Invocation.java:152) >> > >> > at >> > >> > >> > >> >> > >> org.apache.hadoop.hbase.ipc.HBaseServer$Call.toString(HBaseServer.java:304) >> > >> > at java.lang.String.valueOf(String.java:2826) >> > >> > at java.lang.StringBuilder.append(StringBuilder.java:115) >> > >> > >> > >> >> > > >> > > >> > >> > >
