Here is a totally different error for an open TSDB cluster. Any idea why it could fail to post to zk node?
2015-05-20 16:45:31,151 ERROR org.apache.hadoop.hbase.procedure.ProcedureMember: Failed to post zk node:/hbase/online-snapshot/reached/ss_tsdb/ opentsdb-live-regionserver-46487291.ec2.pin220.com,60020,1430950243187 to join procedure barrier. java.io.IOException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/online-snapshot/reached/ss_tsdb/ opentsdb-live-regionserver-46487291.ec2.pin220.com,60020,1430950243187 at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.sendMemberCompleted(ZKProcedureMemberRpcs.java:267) at org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:185) at org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:52) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:662) Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/online-snapshot/reached/ss_tsdb/ opentsdb-live-regionserver-46487291.ec2.pin220.com,60020,1430950243187 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.createNonSequential(RecoverableZooKeeper.java:421) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.create(RecoverableZooKeeper.java:403) at org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndFailSilent(ZKUtil.java:1110) at org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndFailSilent(ZKUtil.java:1099) at org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndFailSilent(ZKUtil.java:1083) at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.sendMemberCompleted(ZKProcedureMemberRpcs.java:265) ... 7 more 2015-05-20 16:45:31,152 ERROR org.apache.hadoop.hbase.procedure.Subprocedure: Failed to post zk node:/hbase/online-snapshot/reached/ss_tsdb/ opentsdb-live-regionserver-46487291.ec2.pin220.com,60020,1430950243187 to join procedure barrier. java.io.IOException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/online-snapshot/reached/ss_tsdb/ opentsdb-live-regionserver-46487291.ec2.pin220.com,60020,1430950243187 at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.sendMemberCompleted(ZKProcedureMemberRpcs.java:267) at org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:185) at org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:52) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:662) Thanks Tian-Ying On Wed, May 20, 2015 at 9:26 AM, Tianying Chang <[email protected]> wrote: > Just want to report some update back. > > I fixed the problem by performing a rolling restart. From the DEBUG > logging on the RS, I can see the DEBUG information pasted below, it is > doing SKIP_FLUSH to take snapshot. Before rolling restart, none of these > got printout when taking snapshot for this table (Other tables are printing > these information) One thing to mention is I have to use SKIP_FLUSH, > otherwise, snapshot will still fail, probably because the special traffic > to this cluster. I am not sure if it is just heavy write, since we have > other clusters that take on even higher write that does not have this > problem. It seems to me that the RS are in some weird bad state that it > won't take snapshot, not sure where the bug is though. > > > > 015-05-20 16:11:34,230 DEBUG > org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure: > take snapshot without flush memstore first > 2015-05-20 16:11:34,230 DEBUG > org.apache.hadoop.hbase.regionserver.HRegion: Storing region-info for > snapshot. > 2015-05-20 16:11:34,255 DEBUG > org.apache.hadoop.hbase.regionserver.HRegion: Creating references for hfiles > 2015-05-20 16:11:34,255 DEBUG > org.apache.hadoop.hbase.regionserver.HRegion: Adding snapshot references > for [hdfs:// > ec2-54-243-40-6.compute-1.amazonaws.com/hbase/rich_pin_data_v1/c578f8a0b71e55a14c652ba16aa54b45/d/a60f3b6680ab48728bff61be3d1b419c, > hdfs:// > ec2-54-243-40-6.compute-1.amazonaws.com/hbase/rich_pin_data_v1/c578f8a0b71e55a14c652ba16aa54b45/d/771b8125162f4866a002693152704280, > hdfs:// > ec2-54-243-40-6.compute-1.amazonaws.com/hbase/rich_pin_data_v1/c578f8a0b71e55a14c652ba16aa54b45/d/633f76def221463991611b0b9d007bbd, > hdfs:// > ec2-54-243-40-6.compute-1.amazonaws.com/hbase/rich_pin_data_v1/c578f8a0b71e55a14c652ba16aa54b45/d/82b91072d82c4f0a97960dd9bd65b814, > hdfs:// > ec2-54-243-40-6.compute-1.amazonaws.com/hbase/rich_pin_data_v1/c578f8a0b71e55a14c652ba16aa54b45/d/76dac71663ac40919d227f039307cd6d] > hfiles > > > Thanks > Tian-Ying > > > On Tue, May 19, 2015 at 2:26 PM, Esteban Gutierrez <[email protected]> > wrote: > >> http://pastebin.com or http://gist.github.com work fine. >> >> thanks, >> esteban. >> >> >> -- >> Cloudera, Inc. >> >> >> On Tue, May 19, 2015 at 2:23 PM, Tianying Chang <[email protected]> >> wrote: >> >> > Sure, Esteban. Where is a good place to upload the log? >> > >> > On Tue, May 19, 2015 at 2:01 PM, Esteban Gutierrez < >> [email protected]> >> > wrote: >> > >> > > The latest log is very interesting Tianying, but I don't see how is >> > related >> > > to the initial CorruptedSnapshotException since ZKProcedureMemberRpcs >> is >> > > aborting the operation due a timeout. Could be possible for you to >> upload >> > > the HBase master and region server logs to pastebin or some other >> site? >> > > just around the time you started the snapshot and it failed should be >> > fine. >> > > >> > > thanks, >> > > esteban. >> > > >> > > >> > > -- >> > > Cloudera, Inc. >> > > >> > > >> > > On Tue, May 19, 2015 at 1:45 PM, Tianying Chang <[email protected]> >> > wrote: >> > > >> > > > Matteo >> > > > >> > > > >> > > > By looking at the DEBUG log at RS side, it seems to me that no >> online >> > > > regions were pickedup. So it seems to me that this call returns 0 >> > > regions. >> > > > But I am not sure how that happens. Is there anyway to verify this? >> > > > >> > > > involvedRegions = getRegionsToSnapshot(snapshot); >> > > > >> > > > 2015-05-19 20:36:22,223 DEBUG >> > > > >> > > >> > >> org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager: >> > > > cancelling 0 tasks for snapshot >> > > > >> > > > >> > > > >> > > > Thanks >> > > > Tian-Ying >> > > > >> > > > Full log: >> > > > >> > > > 2015-05-19 20:35:46,684 INFO >> > > > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Received >> > > procedure >> > > > start children changed event: /hbase/online-snapshot/acquired >> > > > 2015-05-19 20:35:46,686 DEBUG >> > > > >> > > >> > >> org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager: >> > > > Launching subprocedure for snapshot ss_rich_pin_data_v1 from table >> > > > rich_pin_data_v1 >> > > > 2015-05-19 20:36:21,723 INFO >> > > > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Received >> > created >> > > > event:/hbase/online-snapshot/abort/ss_rich_pin_data_v1 >> > > > 2015-05-19 20:36:21,724 ERROR >> > > > org.apache.hadoop.hbase.procedure.ProcedureMember: Propagating >> foreign >> > > > exception to subprocedure ss_rich_pin_data_v1 >> > > > >> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable >> > via >> > > > timer-java.util.Timer@5add830c >> > > > >> :org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: >> > > > org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout >> > elapsed! >> > > > Source:Timeout caused Foreign Exception Start:1432067746683, >> > > > End:1432067781681, diff:34998, max:35000 ms >> > > > at >> > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171) >> > > > at >> > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320) >> > > > at >> > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeCreated(ZKProcedureMemberRpcs.java:95) >> > > > at >> > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:290) >> > > > at >> > > > >> > > > >> > > >> > >> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519) >> > > > at >> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) >> > > > Caused by: >> > > > >> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: >> > > > org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout >> > elapsed! >> > > > Source:Timeout caused Foreign Exception Start:1432067746683, >> > > > End:1432067781681, diff:34998, max:35000 ms >> > > > at >> > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:71) >> > > > at java.util.TimerThread.mainLoop(Timer.java:512) >> > > > at java.util.TimerThread.run(Timer.java:462) >> > > > 2015-05-19 20:36:21,724 INFO >> > > > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Received >> > > procedure >> > > > abort children changed event: /hbase/online-snapshot/abort >> > > > 2015-05-19 20:36:21,726 ERROR >> > > > org.apache.hadoop.hbase.procedure.ProcedureMember: Propagating >> foreign >> > > > exception to subprocedure ss_rich_pin_data_v1 >> > > > >> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable >> > via >> > > > timer-java.util.Timer@5add830c >> > > > >> :org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: >> > > > org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout >> > elapsed! >> > > > Source:Timeout caused Foreign Exception Start:1432067746683, >> > > > End:1432067781681, diff:34998, max:35000 ms >> > > > at >> > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171) >> > > > at >> > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320) >> > > > at >> > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.watchForAbortedProcedures(ZKProcedureMemberRpcs.java:143) >> > > > at >> > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.access$200(ZKProcedureMemberRpcs.java:56) >> > > > at >> > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeChildrenChanged(ZKProcedureMemberRpcs.java:111) >> > > > at >> > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:311) >> > > > at >> > > > >> > > > >> > > >> > >> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519) >> > > > at >> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) >> > > > Caused by: >> > > > >> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: >> > > > org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout >> > elapsed! >> > > > Source:Timeout caused Foreign Exception Start:1432067746683, >> > > > End:1432067781681, diff:34998, max:35000 ms >> > > > at >> > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:71) >> > > > at java.util.TimerThread.mainLoop(Timer.java:512) >> > > > at java.util.TimerThread.run(Timer.java:462) >> > > > 2015-05-19 20:36:21,780 INFO >> > > > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Received >> > > procedure >> > > > start children changed event: /hbase/online-snapshot/acquired >> > > > 2015-05-19 20:36:21,784 INFO >> > > > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Received >> > > procedure >> > > > abort children changed event: /hbase/online-snapshot/abort >> > > > 2015-05-19 20:36:22,221 ERROR >> > > > org.apache.hadoop.hbase.procedure.Subprocedure: Subprocedure >> > > > 'ss_rich_pin_data_v1' aborting due to a ForeignException! >> > > > org.apache.hadoop.hbase.errorhandling.TimeoutException via >> > > > timer-java.util.Timer@25ff6d2a >> > > > :org.apache.hadoop.hbase.errorhandling.TimeoutException: >> > > > Timeout elapsed! Source:Timeout caused Foreign Exception >> > > > Start:1432067746687, End:1432067781721, diff:35034, max:35000 ms >> > > > at >> > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:85) >> > > > at >> > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.procedure.Procedure.waitForLatch(Procedure.java:369) >> > > > at >> > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.procedure.Subprocedure.waitForReachedGlobalBarrier(Subprocedure.java:296) >> > > > at >> > > > >> > > >> > >> org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:170) >> > > > at >> > > > >> > >> org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:52) >> > > > at >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) >> > > > at java.util.concurrent.FutureTask.run(FutureTask.java:138) >> > > > at >> > > > >> > > > >> > > >> > >> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) >> > > > at >> > > > >> > > > >> > > >> > >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) >> > > > at java.lang.Thread.run(Thread.java:662) >> > > > Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException: >> > > Timeout >> > > > elapsed! Source:Timeout caused Foreign Exception >> Start:1432067746687, >> > > > End:1432067781721, diff:35034, max:35000 ms >> > > > at >> > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:68) >> > > > at java.util.TimerThread.mainLoop(Timer.java:512) >> > > > at java.util.TimerThread.run(Timer.java:462) >> > > > 2015-05-19 20:36:22,223 ERROR >> > > > org.apache.hadoop.hbase.procedure.Subprocedure: Subprocedure >> > > > 'ss_rich_pin_data_v1' aborting due to a ForeignException! >> > > > org.apache.hadoop.hbase.errorhandling.TimeoutException via >> > > > timer-java.util.Timer@25ff6d2a >> > > > :org.apache.hadoop.hbase.errorhandling.TimeoutException: >> > > > Timeout elapsed! Source:Timeout caused Foreign Exception >> > > > Start:1432067746687, End:1432067781721, diff:35034, max:35000 ms >> > > > at >> > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:85) >> > > > at >> > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.procedure.Procedure.waitForLatch(Procedure.java:369) >> > > > at >> > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.procedure.Subprocedure.waitForReachedGlobalBarrier(Subprocedure.java:296) >> > > > at >> > > > >> > > >> > >> org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:170) >> > > > at >> > > > >> > >> org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:52) >> > > > at >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) >> > > > at java.util.concurrent.FutureTask.run(FutureTask.java:138) >> > > > at >> > > > >> > > > >> > > >> > >> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) >> > > > at >> > > > >> > > > >> > > >> > >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) >> > > > at java.lang.Thread.run(Thread.java:662) >> > > > Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException: >> > > Timeout >> > > > elapsed! Source:Timeout caused Foreign Exception >> Start:1432067746687, >> > > > End:1432067781721, diff:35034, max:35000 ms >> > > > at >> > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:68) >> > > > at java.util.TimerThread.mainLoop(Timer.java:512) >> > > > at java.util.TimerThread.run(Timer.java:462) >> > > > 2015-05-19 20:36:22,223 INFO >> > > > >> > org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure: >> > > > Aborting all online FLUSH snapshot subprocedure task threads for >> > > > 'ss_rich_pin_data_v1' due to error >> > > > org.apache.hadoop.hbase.errorhandling.TimeoutException via >> > > > timer-java.util.Timer@25ff6d2a >> > > > :org.apache.hadoop.hbase.errorhandling.TimeoutException: >> > > > Timeout elapsed! Source:Timeout caused Foreign Exception >> > > > Start:1432067746687, End:1432067781721, diff:35034, max:35000 ms >> > > > at >> > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:85) >> > > > at >> > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.procedure.Procedure.waitForLatch(Procedure.java:369) >> > > > at >> > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.procedure.Subprocedure.waitForReachedGlobalBarrier(Subprocedure.java:296) >> > > > at >> > > > >> > > >> > >> org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:170) >> > > > at >> > > > >> > >> org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:52) >> > > > at >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) >> > > > at java.util.concurrent.FutureTask.run(FutureTask.java:138) >> > > > at >> > > > >> > > > >> > > >> > >> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) >> > > > at >> > > > >> > > > >> > > >> > >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) >> > > > at java.lang.Thread.run(Thread.java:662) >> > > > Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException: >> > > Timeout >> > > > elapsed! Source:Timeout caused Foreign Exception >> Start:1432067746687, >> > > > End:1432067781721, diff:35034, max:35000 ms >> > > > at >> > > > >> > > > >> > > >> > >> org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:68) >> > > > at java.util.TimerThread.mainLoop(Timer.java:512) >> > > > at java.util.TimerThread.run(Timer.java:462) >> > > > 2015-05-19 20:36:22,223 DEBUG >> > > > >> > > >> > >> org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager: >> > > > cancelling 0 tasks for snapshot >> > > > >> > > > On Tue, May 19, 2015 at 1:30 PM, Tianying Chang <[email protected]> >> > > wrote: >> > > > >> > > > > Matteo >> > > > > >> > > > > We are using hdfs2.0 + HBase94.7. >> > > > > >> > > > > I saw this ArrayIndexOutOfBoundsException: 2 error also. What does >> > that >> > > > > mean? >> > > > > >> > > > > BTW, other tables (but those are smaller in terms of region >> count) in >> > > > this >> > > > > same cluster is able to create snapshot, only this table is >> failing. >> > > > > >> > > > > Thanks >> > > > > Tian-Ying >> > > > > >> > > > > On Tue, May 19, 2015 at 11:50 AM, Matteo Bertozzi < >> > > > [email protected] >> > > > > > wrote: >> > > > > >> > > > >> can you debug the protobuf problem, I think we abort because we >> are >> > > not >> > > > >> able to write >> > > > >> >> > > > >> 2015-05-19 06:00:49,745 WARN org.apache.hadoop.ipc.HBaseServer: >> IPC >> > > > Server >> > > > >> handler 50 on 60000 caught: >> > java.lang.ArrayIndexOutOfBoundsException: >> > > 2 >> > > > >> at java.util.Arrays$ArrayList.get(Arrays.java:3381) >> > > > >> at >> > > > >> java.util.Collections$UnmodifiableList.get(Collections.java:1152) >> > > > >> at >> > > > >> >> > > > >> >> > > > >> > > >> > >> org.apache.hadoop.hbase.protobuf.generated.HBaseProtos$SnapshotDescription$Type.getValueDescriptor(HBaseProtos.java:99) >> > > > >> ... >> > > > >> >> > com.google.protobuf.AbstractMessage.toString(AbstractMessage.java:86) >> > > > >> at >> > > > >> >> > > > >> >> > > > >> > > >> > >> org.apache.hadoop.hbase.snapshot.HSnapshotDescription.toString(HSnapshotDescription.java:72) >> > > > >> at java.lang.String.valueOf(String.java:2826) >> > > > >> at java.lang.StringBuilder.append(StringBuilder.java:115) >> > > > >> at >> > > > >> >> org.apache.hadoop.hbase.ipc.Invocation.toString(Invocation.java:152) >> > > > >> at >> > > > >> >> > > > >> >> > > > >> > > >> > >> org.apache.hadoop.hbase.ipc.HBaseServer$Call.toString(HBaseServer.java:304) >> > > > >> >> > > > >> Matteo >> > > > >> >> > > > >> >> > > > >> On Tue, May 19, 2015 at 11:35 AM, Tianying Chang < >> [email protected] >> > > >> > > > >> wrote: >> > > > >> >> > > > >> > Actually, I find it does not even print out the debug info >> below >> > for >> > > > >> this >> > > > >> > table, other table will print out this logging. So it seems it >> did >> > > not >> > > > >> > invoke the FlushSnapshotSubprocedure at all. >> > > > >> > >> > > > >> > >> > > > >> > @Override >> > > > >> > public Void call() throws Exception { >> > > > >> > // Taking the region read lock prevents the individual >> > region >> > > > from >> > > > >> > being closed while a >> > > > >> > // snapshot is in progress. This is helpful but not >> > > sufficient >> > > > >> for >> > > > >> > preventing races with >> > > > >> > // snapshots that involve multiple regions and >> > regionservers. >> > > > It >> > > > >> is >> > > > >> > still possible to have >> > > > >> > // an interleaving such that globally regions are >> missing, >> > so >> > > we >> > > > >> > still need the verification >> > > > >> > // step. >> > > > >> > LOG.debug("Starting region operation on " + region); >> > > > >> > >> > > > >> > On Tue, May 19, 2015 at 11:26 AM, Tianying Chang < >> > [email protected] >> > > > >> > > > >> > wrote: >> > > > >> > >> > > > >> > > Hi, Esteban, >> > > > >> > > >> > > > >> > > There is no region split in this cluster, since we put the >> > region >> > > > size >> > > > >> > > upper bound to be really high to prevent splitting. >> > > > >> > > >> > > > >> > > I think it happens for all the regions of this table. >> > > > >> > > >> > > > >> > > I repeatedly run "hdfs dfs -lsr >> > > > >> > > /hbase/.hbase-snapshot/ss_rich_pin_data_v1" while taking >> > > snapshot, >> > > > no >> > > > >> > > region was able to write into this direction. I also turn on >> > DEBUG >> > > > >> > logging >> > > > >> > > on RS, all RS just report fail with Timeout, with no >> specific >> > > > reason. >> > > > >> > > >> > > > >> > > Thanks >> > > > >> > > Tian-Ying >> > > > >> > > >> > > > >> > > On Tue, May 19, 2015 at 11:06 AM, Esteban Gutierrez < >> > > > >> > [email protected]> >> > > > >> > > wrote: >> > > > >> > > >> > > > >> > >> Hi Tianying, >> > > > >> > >> >> > > > >> > >> Is this happening consistently in this region or is it >> > happening >> > > > >> > randomly >> > > > >> > >> across other regions too? One possibility is that there was >> a >> > > split >> > > > >> > going >> > > > >> > >> on at the time you started to take the snapshot and it >> failed. >> > If >> > > > you >> > > > >> > look >> > > > >> > >> into /hbase/rich_pin_data_v1 can you find a directory named >> > > > >> > >> dff681880bb2b23d0351d6656a1dbbb9 in there? >> > > > >> > >> >> > > > >> > >> cheers, >> > > > >> > >> esteban. >> > > > >> > >> >> > > > >> > >> >> > > > >> > >> -- >> > > > >> > >> Cloudera, Inc. >> > > > >> > >> >> > > > >> > >> >> > > > >> > >> On Mon, May 18, 2015 at 11:12 PM, Tianying Chang < >> > > > [email protected]> >> > > > >> > >> wrote: >> > > > >> > >> >> > > > >> > >> > Hi, >> > > > >> > >> > >> > > > >> > >> > We have a cluster that used to be able to take snapshot. >> But >> > > > >> recently, >> > > > >> > >> one >> > > > >> > >> > table failed due to the error below. Other tables on the >> same >> > > > >> clusters >> > > > >> > >> are >> > > > >> > >> > fine. >> > > > >> > >> > >> > > > >> > >> > Any idea what could go wrong? Is the table not healthy? >> But I >> > > run >> > > > >> > hbase >> > > > >> > >> > hbck, it reports cluster healthy. >> > > > >> > >> > >> > > > >> > >> > BTW, we are running 94.7, so we need to take snapshot of >> the >> > > data >> > > > >> to >> > > > >> > >> export >> > > > >> > >> > to a new cluster of 94.26 as part of upgrade (and >> eventually >> > > > >> upgrade >> > > > >> > to >> > > > >> > >> > 1.x) >> > > > >> > >> > >> > > > >> > >> > Thanks >> > > > >> > >> > Tian-Ying >> > > > >> > >> > >> > > > >> > >> > >> > > > >> > >> > 015-05-19 06:00:45,505 ERROR >> > > > >> > >> > >> org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler: >> > > > Failed >> > > > >> > >> taking >> > > > >> > >> > snapshot { ss=ss_rich_pin_data_v1 table=rich_pin_data_v1 >> > > > >> > type=SKIPFLUSH >> > > > >> > >> } >> > > > >> > >> > due to exception:No region directory found for >> region:{NAME >> > => >> > > > >> > >> > >> > > > >> >> 'rich_pin_data_v1,,1389319134976.dff681880bb2b23d0351d6656a1dbbb9.', >> > > > >> > >> > STARTKEY => '', ENDKEY => >> '001ff3a165ff571471603035ca7b4be9', >> > > > >> ENCODED >> > > > >> > => >> > > > >> > >> > dff681880bb2b23d0351d6656a1dbbb9,} >> > > > >> > >> > >> org.apache.hadoop.hbase.snapshot.CorruptedSnapshotException: >> > No >> > > > >> region >> > > > >> > >> > directory found for region:{NAME => >> > > > >> > >> > >> > > > >> >> 'rich_pin_data_v1,,1389319134976.dff681880bb2b23d0351d6656a1dbbb9.', >> > > > >> > >> > STARTKEY => '', ENDKEY => >> '001ff3a165ff571471603035ca7b4be9', >> > > > >> ENCODED >> > > > >> > => >> > > > >> > >> > dff681880bb2b23d0351d6656a1dbbb9,} >> > > > >> > >> > at >> > > > >> > >> > >> > > > >> > >> > >> > > > >> > >> >> > > > >> > >> > > > >> >> > > > >> > > >> > >> org.apache.hadoop.hbase.master.snapshot.MasterSnapshotVerifier.verifyRegion(MasterSnapshotVerifier.java:167) >> > > > >> > >> > at >> > > > >> > >> > >> > > > >> > >> > >> > > > >> > >> >> > > > >> > >> > > > >> >> > > > >> > > >> > >> org.apache.hadoop.hbase.master.snapshot.MasterSnapshotVerifier.verifyRegions(MasterSnapshotVerifier.java:152) >> > > > >> > >> > at >> > > > >> > >> > >> > > > >> > >> > >> > > > >> > >> >> > > > >> > >> > > > >> >> > > > >> > > >> > >> org.apache.hadoop.hbase.master.snapshot.MasterSnapshotVerifier.verifySnapshot(MasterSnapshotVerifier.java:115) >> > > > >> > >> > at >> > > > >> > >> > >> > > > >> > >> > >> > > > >> > >> >> > > > >> > >> > > > >> >> > > > >> > > >> > >> org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.process(TakeSnapshotHandler.java:156) >> > > > >> > >> > at >> > > > >> > >> > >> > > > >> > >> > > > >> > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175) >> > > > >> > >> > at >> > > > >> > >> > >> > > > >> > >> > >> > > > >> > >> >> > > > >> > >> > > > >> >> > > > >> > > >> > >> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >> > > > >> > >> > at >> > > > >> > >> > >> > > > >> > >> > >> > > > >> > >> >> > > > >> > >> > > > >> >> > > > >> > > >> > >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >> > > > >> > >> > at java.lang.Thread.run(Thread.java:662) >> > > > >> > >> > 2015-05-19 06:00:45,505 INFO >> > > > >> > >> > >> org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler: >> > > Stop >> > > > >> > taking >> > > > >> > >> > snapshot={ ss=ss_rich_pin_data_v1 table=rich_pin_data_v1 >> > > > >> > type=SKIPFLUSH >> > > > >> > >> } >> > > > >> > >> > because: Failed to take snapshot '{ ss=ss_rich_pin_data_v1 >> > > > >> > >> > table=rich_pin_data_v1 type=SKIPFLUSH }' due to exception >> > > > >> > >> > 2015-05-19 06:00:49,745 WARN >> > org.apache.hadoop.ipc.HBaseServer: >> > > > IPC >> > > > >> > >> Server >> > > > >> > >> > handler 50 on 60000 caught: >> > > > >> java.lang.ArrayIndexOutOfBoundsException: >> > > > >> > 2 >> > > > >> > >> > at >> java.util.Arrays$ArrayList.get(Arrays.java:3381) >> > > > >> > >> > at >> > > > >> > >> > >> > > java.util.Collections$UnmodifiableList.get(Collections.java:1152) >> > > > >> > >> > at >> > > > >> > >> > >> > > > >> > >> > >> > > > >> > >> >> > > > >> > >> > > > >> >> > > > >> > > >> > >> org.apache.hadoop.hbase.protobuf.generated.HBaseProtos$SnapshotDescription$Type.getValueDescriptor(HBaseProtos.java:99) >> > > > >> > >> > at >> > sun.reflect.NativeMethodAccessorImpl.invoke0(Native >> > > > >> Method) >> > > > >> > >> > at >> > > > >> > >> > >> > > > >> > >> > >> > > > >> > >> >> > > > >> > >> > > > >> >> > > > >> > > >> > >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> > > > >> > >> > at >> > > > >> > >> > >> > > > >> > >> > >> > > > >> > >> >> > > > >> > >> > > > >> >> > > > >> > > >> > >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> > > > >> > >> > at >> java.lang.reflect.Method.invoke(Method.java:597) >> > > > >> > >> > at >> > > > >> > >> > >> > > > >> > >> > >> > > > >> > >> >> > > > >> > >> > > > >> >> > > > >> > > >> > >> com.google.protobuf.GeneratedMessage.invokeOrDie(GeneratedMessage.java:1369) >> > > > >> > >> > at >> > > > >> > >> > >> > > > >> > >> >> > > > >> > >> > > > >> >> > > > >> > > >> > >> com.google.protobuf.GeneratedMessage.access$1400(GeneratedMessage.java:57) >> > > > >> > >> > at >> > > > >> > >> > >> > > > >> > >> > >> > > > >> > >> >> > > > >> > >> > > > >> >> > > > >> > > >> > >> com.google.protobuf.GeneratedMessage$FieldAccessorTable$SingularEnumFieldAccessor.get(GeneratedMessage.java:1670) >> > > > >> > >> > at >> > > > >> > >> > >> > > > >> > >> > > > >> > com.google.protobuf.GeneratedMessage.getField(GeneratedMessage.java:162) >> > > > >> > >> > at >> > > > >> > >> > >> > > > >> > >> > >> > > > >> > >> >> > > > >> > >> > > > >> >> > > > >> > > >> > >> com.google.protobuf.GeneratedMessage.getAllFieldsMutable(GeneratedMessage.java:113) >> > > > >> > >> > at >> > > > >> > >> > >> > > > >> > >> > >> > > > >> > >> >> > > > >> > >> > > > >> >> > > > >> > > >> > >> com.google.protobuf.GeneratedMessage.getAllFields(GeneratedMessage.java:152) >> > > > >> > >> > at >> > > > >> > >> > >> > > com.google.protobuf.TextFormat$Printer.print(TextFormat.java:228) >> > > > >> > >> > at >> > > > >> > >> > >> > > > >> >> > com.google.protobuf.TextFormat$Printer.access$200(TextFormat.java:217) >> > > > >> > >> > at >> > > > com.google.protobuf.TextFormat.print(TextFormat.java:68) >> > > > >> > >> > at >> > > > >> > >> > >> > > com.google.protobuf.TextFormat.printToString(TextFormat.java:115) >> > > > >> > >> > at >> > > > >> > >> > >> > > > >> >> > com.google.protobuf.AbstractMessage.toString(AbstractMessage.java:86) >> > > > >> > >> > at >> > > > >> > >> > >> > > > >> > >> > >> > > > >> > >> >> > > > >> > >> > > > >> >> > > > >> > > >> > >> org.apache.hadoop.hbase.snapshot.HSnapshotDescription.toString(HSnapshotDescription.java:72) >> > > > >> > >> > at java.lang.String.valueOf(String.java:2826) >> > > > >> > >> > at >> > > java.lang.StringBuilder.append(StringBuilder.java:115) >> > > > >> > >> > at >> > > > >> > >> > >> > > > >> >> org.apache.hadoop.hbase.ipc.Invocation.toString(Invocation.java:152) >> > > > >> > >> > at >> > > > >> > >> > >> > > > >> > >> >> > > > >> > >> > > > >> >> > > > >> > > >> > >> org.apache.hadoop.hbase.ipc.HBaseServer$Call.toString(HBaseServer.java:304) >> > > > >> > >> > at java.lang.String.valueOf(String.java:2826) >> > > > >> > >> > at >> > > java.lang.StringBuilder.append(StringBuilder.java:115) >> > > > >> > >> > >> > > > >> > >> >> > > > >> > > >> > > > >> > > >> > > > >> > >> > > > >> >> > > > > >> > > > > >> > > > >> > > >> > >> > >
