Re: Failed to take snapshot due to some region directory is not found

Tianying Chang Wed, 20 May 2015 09:52:58 -0700

Here is a totally different error for an open TSDB cluster. Any idea why it
could fail to post to zk node?




2015-05-20 16:45:31,151 ERROR
org.apache.hadoop.hbase.procedure.ProcedureMember: Failed to post zk
node:/hbase/online-snapshot/reached/ss_tsdb/
opentsdb-live-regionserver-46487291.ec2.pin220.com,60020,1430950243187 to
join procedure barrier.
java.io.IOException: org.apache.zookeeper.KeeperException$NoNodeException:
KeeperErrorCode = NoNode for /hbase/online-snapshot/reached/ss_tsdb/
opentsdb-live-regionserver-46487291.ec2.pin220.com,60020,1430950243187
at
org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.sendMemberCompleted(ZKProcedureMemberRpcs.java:267)
at
org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:185)
at org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:52)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
KeeperErrorCode = NoNode for /hbase/online-snapshot/reached/ss_tsdb/
opentsdb-live-regionserver-46487291.ec2.pin220.com,60020,1430950243187
at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
at
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.createNonSequential(RecoverableZooKeeper.java:421)
at
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.create(RecoverableZooKeeper.java:403)
at
org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndFailSilent(ZKUtil.java:1110)
at
org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndFailSilent(ZKUtil.java:1099)
at
org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndFailSilent(ZKUtil.java:1083)
at
org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.sendMemberCompleted(ZKProcedureMemberRpcs.java:265)
... 7 more
2015-05-20 16:45:31,152 ERROR
org.apache.hadoop.hbase.procedure.Subprocedure: Failed to post zk
node:/hbase/online-snapshot/reached/ss_tsdb/
opentsdb-live-regionserver-46487291.ec2.pin220.com,60020,1430950243187 to
join procedure barrier.
java.io.IOException: org.apache.zookeeper.KeeperException$NoNodeException:
KeeperErrorCode = NoNode for /hbase/online-snapshot/reached/ss_tsdb/
opentsdb-live-regionserver-46487291.ec2.pin220.com,60020,1430950243187
at
org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.sendMemberCompleted(ZKProcedureMemberRpcs.java:267)
at
org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:185)
at org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:52)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)

Thanks
Tian-Ying

On Wed, May 20, 2015 at 9:26 AM, Tianying Chang <[email protected]> wrote:

> Just want to report some update back.
>
> I fixed the problem by performing a rolling restart. From the DEBUG
> logging on the RS, I can see the DEBUG information pasted below, it is
> doing SKIP_FLUSH to take snapshot.  Before rolling restart, none of these
> got printout when taking snapshot for this table (Other tables are printing
> these information)  One thing to mention is I have to use SKIP_FLUSH,
> otherwise, snapshot will still fail, probably because the special traffic
> to this cluster. I am not sure if it is just heavy write, since we have
> other clusters that take on even higher write that does not have this
> problem. It seems to me that the RS are in some weird bad state that it
> won't take snapshot, not sure where the bug is though.
>
>
>
> 015-05-20 16:11:34,230 DEBUG
> org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure:
> take snapshot without flush memstore first
> 2015-05-20 16:11:34,230 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion: Storing region-info for
> snapshot.
> 2015-05-20 16:11:34,255 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion: Creating references for hfiles
> 2015-05-20 16:11:34,255 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion: Adding snapshot references
> for [hdfs://
> ec2-54-243-40-6.compute-1.amazonaws.com/hbase/rich_pin_data_v1/c578f8a0b71e55a14c652ba16aa54b45/d/a60f3b6680ab48728bff61be3d1b419c,
> hdfs://
> ec2-54-243-40-6.compute-1.amazonaws.com/hbase/rich_pin_data_v1/c578f8a0b71e55a14c652ba16aa54b45/d/771b8125162f4866a002693152704280,
> hdfs://
> ec2-54-243-40-6.compute-1.amazonaws.com/hbase/rich_pin_data_v1/c578f8a0b71e55a14c652ba16aa54b45/d/633f76def221463991611b0b9d007bbd,
> hdfs://
> ec2-54-243-40-6.compute-1.amazonaws.com/hbase/rich_pin_data_v1/c578f8a0b71e55a14c652ba16aa54b45/d/82b91072d82c4f0a97960dd9bd65b814,
> hdfs://
> ec2-54-243-40-6.compute-1.amazonaws.com/hbase/rich_pin_data_v1/c578f8a0b71e55a14c652ba16aa54b45/d/76dac71663ac40919d227f039307cd6d]
> hfiles
>
>
> Thanks
> Tian-Ying
>
>
> On Tue, May 19, 2015 at 2:26 PM, Esteban Gutierrez <[email protected]>
> wrote:
>
>> http://pastebin.com or http://gist.github.com work fine.
>>
>> thanks,
>> esteban.
>>
>>
>> --
>> Cloudera, Inc.
>>
>>
>> On Tue, May 19, 2015 at 2:23 PM, Tianying Chang <[email protected]>
>> wrote:
>>
>> > Sure, Esteban. Where is a good place to upload the log?
>> >
>> > On Tue, May 19, 2015 at 2:01 PM, Esteban Gutierrez <
>> [email protected]>
>> > wrote:
>> >
>> > > The latest log is very interesting Tianying, but I don't see how is
>> > related
>> > > to the initial CorruptedSnapshotException since ZKProcedureMemberRpcs
>> is
>> > > aborting the operation due a timeout. Could be possible for you to
>> upload
>> > > the HBase master and region server logs to pastebin or some other
>> site?
>> > > just around the time you started the snapshot and it failed should be
>> > fine.
>> > >
>> > > thanks,
>> > > esteban.
>> > >
>> > >
>> > > --
>> > > Cloudera, Inc.
>> > >
>> > >
>> > > On Tue, May 19, 2015 at 1:45 PM, Tianying Chang <[email protected]>
>> > wrote:
>> > >
>> > > > Matteo
>> > > >
>> > > >
>> > > > By looking at the DEBUG log at RS side, it seems to me that no
>> online
>> > > > regions were pickedup. So it seems to me that this call returns 0
>> > > regions.
>> > > > But I am not sure how that happens. Is there anyway to verify this?
>> > > >
>> > > >  involvedRegions = getRegionsToSnapshot(snapshot);
>> > > >
>> > > > 2015-05-19 20:36:22,223 DEBUG
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager:
>> > > > cancelling 0 tasks for snapshot
>> > > >
>> > > >
>> > > >
>> > > > Thanks
>> > > > Tian-Ying
>> > > >
>> > > > Full log:
>> > > >
>> > > > 2015-05-19 20:35:46,684 INFO
>> > > > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Received
>> > > procedure
>> > > > start children changed event: /hbase/online-snapshot/acquired
>> > > > 2015-05-19 20:35:46,686 DEBUG
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager:
>> > > > Launching subprocedure for snapshot ss_rich_pin_data_v1 from table
>> > > > rich_pin_data_v1
>> > > > 2015-05-19 20:36:21,723 INFO
>> > > > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Received
>> > created
>> > > > event:/hbase/online-snapshot/abort/ss_rich_pin_data_v1
>> > > > 2015-05-19 20:36:21,724 ERROR
>> > > > org.apache.hadoop.hbase.procedure.ProcedureMember: Propagating
>> foreign
>> > > > exception to subprocedure ss_rich_pin_data_v1
>> > > >
>> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable
>> > via
>> > > > timer-java.util.Timer@5add830c
>> > > >
>> :org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
>> > > > org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout
>> > elapsed!
>> > > > Source:Timeout caused Foreign Exception Start:1432067746683,
>> > > > End:1432067781681, diff:34998, max:35000 ms
>> > > > at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
>> > > > at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320)
>> > > > at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeCreated(ZKProcedureMemberRpcs.java:95)
>> > > > at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:290)
>> > > > at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519)
>> > > > at
>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
>> > > > Caused by:
>> > > >
>> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
>> > > > org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout
>> > elapsed!
>> > > > Source:Timeout caused Foreign Exception Start:1432067746683,
>> > > > End:1432067781681, diff:34998, max:35000 ms
>> > > > at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:71)
>> > > > at java.util.TimerThread.mainLoop(Timer.java:512)
>> > > > at java.util.TimerThread.run(Timer.java:462)
>> > > > 2015-05-19 20:36:21,724 INFO
>> > > > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Received
>> > > procedure
>> > > > abort children changed event: /hbase/online-snapshot/abort
>> > > > 2015-05-19 20:36:21,726 ERROR
>> > > > org.apache.hadoop.hbase.procedure.ProcedureMember: Propagating
>> foreign
>> > > > exception to subprocedure ss_rich_pin_data_v1
>> > > >
>> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable
>> > via
>> > > > timer-java.util.Timer@5add830c
>> > > >
>> :org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
>> > > > org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout
>> > elapsed!
>> > > > Source:Timeout caused Foreign Exception Start:1432067746683,
>> > > > End:1432067781681, diff:34998, max:35000 ms
>> > > > at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
>> > > > at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320)
>> > > > at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.watchForAbortedProcedures(ZKProcedureMemberRpcs.java:143)
>> > > > at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.access$200(ZKProcedureMemberRpcs.java:56)
>> > > > at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeChildrenChanged(ZKProcedureMemberRpcs.java:111)
>> > > > at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:311)
>> > > > at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519)
>> > > > at
>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
>> > > > Caused by:
>> > > >
>> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
>> > > > org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout
>> > elapsed!
>> > > > Source:Timeout caused Foreign Exception Start:1432067746683,
>> > > > End:1432067781681, diff:34998, max:35000 ms
>> > > > at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:71)
>> > > > at java.util.TimerThread.mainLoop(Timer.java:512)
>> > > > at java.util.TimerThread.run(Timer.java:462)
>> > > > 2015-05-19 20:36:21,780 INFO
>> > > > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Received
>> > > procedure
>> > > > start children changed event: /hbase/online-snapshot/acquired
>> > > > 2015-05-19 20:36:21,784 INFO
>> > > > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs: Received
>> > > procedure
>> > > > abort children changed event: /hbase/online-snapshot/abort
>> > > > 2015-05-19 20:36:22,221 ERROR
>> > > > org.apache.hadoop.hbase.procedure.Subprocedure: Subprocedure
>> > > > 'ss_rich_pin_data_v1' aborting due to a ForeignException!
>> > > > org.apache.hadoop.hbase.errorhandling.TimeoutException via
>> > > > timer-java.util.Timer@25ff6d2a
>> > > > :org.apache.hadoop.hbase.errorhandling.TimeoutException:
>> > > > Timeout elapsed! Source:Timeout caused Foreign Exception
>> > > > Start:1432067746687, End:1432067781721, diff:35034, max:35000 ms
>> > > > at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:85)
>> > > > at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.procedure.Procedure.waitForLatch(Procedure.java:369)
>> > > > at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.procedure.Subprocedure.waitForReachedGlobalBarrier(Subprocedure.java:296)
>> > > > at
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:170)
>> > > > at
>> > > >
>> >
>> org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:52)
>> > > > at
>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>> > > > at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>> > > > at
>> > > >
>> > > >
>> > >
>> >
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
>> > > > at
>> > > >
>> > > >
>> > >
>> >
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
>> > > > at java.lang.Thread.run(Thread.java:662)
>> > > > Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException:
>> > > Timeout
>> > > > elapsed! Source:Timeout caused Foreign Exception
>> Start:1432067746687,
>> > > > End:1432067781721, diff:35034, max:35000 ms
>> > > > at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:68)
>> > > > at java.util.TimerThread.mainLoop(Timer.java:512)
>> > > > at java.util.TimerThread.run(Timer.java:462)
>> > > > 2015-05-19 20:36:22,223 ERROR
>> > > > org.apache.hadoop.hbase.procedure.Subprocedure: Subprocedure
>> > > > 'ss_rich_pin_data_v1' aborting due to a ForeignException!
>> > > > org.apache.hadoop.hbase.errorhandling.TimeoutException via
>> > > > timer-java.util.Timer@25ff6d2a
>> > > > :org.apache.hadoop.hbase.errorhandling.TimeoutException:
>> > > > Timeout elapsed! Source:Timeout caused Foreign Exception
>> > > > Start:1432067746687, End:1432067781721, diff:35034, max:35000 ms
>> > > > at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:85)
>> > > > at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.procedure.Procedure.waitForLatch(Procedure.java:369)
>> > > > at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.procedure.Subprocedure.waitForReachedGlobalBarrier(Subprocedure.java:296)
>> > > > at
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:170)
>> > > > at
>> > > >
>> >
>> org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:52)
>> > > > at
>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>> > > > at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>> > > > at
>> > > >
>> > > >
>> > >
>> >
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
>> > > > at
>> > > >
>> > > >
>> > >
>> >
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
>> > > > at java.lang.Thread.run(Thread.java:662)
>> > > > Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException:
>> > > Timeout
>> > > > elapsed! Source:Timeout caused Foreign Exception
>> Start:1432067746687,
>> > > > End:1432067781721, diff:35034, max:35000 ms
>> > > > at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:68)
>> > > > at java.util.TimerThread.mainLoop(Timer.java:512)
>> > > > at java.util.TimerThread.run(Timer.java:462)
>> > > > 2015-05-19 20:36:22,223 INFO
>> > > >
>> > org.apache.hadoop.hbase.regionserver.snapshot.FlushSnapshotSubprocedure:
>> > > > Aborting all online FLUSH snapshot subprocedure task threads for
>> > > > 'ss_rich_pin_data_v1' due to error
>> > > > org.apache.hadoop.hbase.errorhandling.TimeoutException via
>> > > > timer-java.util.Timer@25ff6d2a
>> > > > :org.apache.hadoop.hbase.errorhandling.TimeoutException:
>> > > > Timeout elapsed! Source:Timeout caused Foreign Exception
>> > > > Start:1432067746687, End:1432067781721, diff:35034, max:35000 ms
>> > > > at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:85)
>> > > > at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.procedure.Procedure.waitForLatch(Procedure.java:369)
>> > > > at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.procedure.Subprocedure.waitForReachedGlobalBarrier(Subprocedure.java:296)
>> > > > at
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:170)
>> > > > at
>> > > >
>> >
>> org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:52)
>> > > > at
>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>> > > > at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>> > > > at
>> > > >
>> > > >
>> > >
>> >
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
>> > > > at
>> > > >
>> > > >
>> > >
>> >
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
>> > > > at java.lang.Thread.run(Thread.java:662)
>> > > > Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException:
>> > > Timeout
>> > > > elapsed! Source:Timeout caused Foreign Exception
>> Start:1432067746687,
>> > > > End:1432067781721, diff:35034, max:35000 ms
>> > > > at
>> > > >
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:68)
>> > > > at java.util.TimerThread.mainLoop(Timer.java:512)
>> > > > at java.util.TimerThread.run(Timer.java:462)
>> > > > 2015-05-19 20:36:22,223 DEBUG
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager:
>> > > > cancelling 0 tasks for snapshot
>> > > >
>> > > > On Tue, May 19, 2015 at 1:30 PM, Tianying Chang <[email protected]>
>> > > wrote:
>> > > >
>> > > > > Matteo
>> > > > >
>> > > > > We are using hdfs2.0 + HBase94.7.
>> > > > >
>> > > > > I saw this ArrayIndexOutOfBoundsException: 2 error also. What does
>> > that
>> > > > > mean?
>> > > > >
>> > > > > BTW, other tables (but those are smaller in terms of region
>> count) in
>> > > > this
>> > > > > same cluster is able to create snapshot, only this table is
>> failing.
>> > > > >
>> > > > > Thanks
>> > > > > Tian-Ying
>> > > > >
>> > > > > On Tue, May 19, 2015 at 11:50 AM, Matteo Bertozzi <
>> > > > [email protected]
>> > > > > > wrote:
>> > > > >
>> > > > >> can you debug the protobuf problem, I think we abort because we
>> are
>> > > not
>> > > > >> able to write
>> > > > >>
>> > > > >> 2015-05-19 06:00:49,745 WARN org.apache.hadoop.ipc.HBaseServer:
>> IPC
>> > > > Server
>> > > > >> handler 50 on 60000 caught:
>> > java.lang.ArrayIndexOutOfBoundsException:
>> > > 2
>> > > > >>         at java.util.Arrays$ArrayList.get(Arrays.java:3381)
>> > > > >>         at
>> > > > >> java.util.Collections$UnmodifiableList.get(Collections.java:1152)
>> > > > >>         at
>> > > > >>
>> > > > >>
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.protobuf.generated.HBaseProtos$SnapshotDescription$Type.getValueDescriptor(HBaseProtos.java:99)
>> > > > >> ...
>> > > > >>
>> > com.google.protobuf.AbstractMessage.toString(AbstractMessage.java:86)
>> > > > >>         at
>> > > > >>
>> > > > >>
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.snapshot.HSnapshotDescription.toString(HSnapshotDescription.java:72)
>> > > > >>         at java.lang.String.valueOf(String.java:2826)
>> > > > >>         at java.lang.StringBuilder.append(StringBuilder.java:115)
>> > > > >>         at
>> > > > >>
>> org.apache.hadoop.hbase.ipc.Invocation.toString(Invocation.java:152)
>> > > > >>         at
>> > > > >>
>> > > > >>
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.ipc.HBaseServer$Call.toString(HBaseServer.java:304)
>> > > > >>
>> > > > >> Matteo
>> > > > >>
>> > > > >>
>> > > > >> On Tue, May 19, 2015 at 11:35 AM, Tianying Chang <
>> [email protected]
>> > >
>> > > > >> wrote:
>> > > > >>
>> > > > >> > Actually, I find it does not even print out the debug info
>> below
>> > for
>> > > > >> this
>> > > > >> > table, other table will print out this logging. So it seems it
>> did
>> > > not
>> > > > >> > invoke the FlushSnapshotSubprocedure at all.
>> > > > >> >
>> > > > >> >
>> > > > >> >  @Override
>> > > > >> >     public Void call() throws Exception {
>> > > > >> >       // Taking the region read lock prevents the individual
>> > region
>> > > > from
>> > > > >> > being closed while a
>> > > > >> >       // snapshot is in progress.  This is helpful but not
>> > > sufficient
>> > > > >> for
>> > > > >> > preventing races with
>> > > > >> >       // snapshots that involve multiple regions and
>> > regionservers.
>> > > > It
>> > > > >> is
>> > > > >> > still possible to have
>> > > > >> >       // an interleaving such that globally regions are
>> missing,
>> > so
>> > > we
>> > > > >> > still need the verification
>> > > > >> >       // step.
>> > > > >> >       LOG.debug("Starting region operation on " + region);
>> > > > >> >
>> > > > >> > On Tue, May 19, 2015 at 11:26 AM, Tianying Chang <
>> > [email protected]
>> > > >
>> > > > >> > wrote:
>> > > > >> >
>> > > > >> > > Hi, Esteban,
>> > > > >> > >
>> > > > >> > > There is no region split in this cluster, since we put the
>> > region
>> > > > size
>> > > > >> > > upper bound to be really high to prevent splitting.
>> > > > >> > >
>> > > > >> > > I think it happens for all the regions of this table.
>> > > > >> > >
>> > > > >> > > I repeatedly run "hdfs dfs -lsr
>> > > > >> > > /hbase/.hbase-snapshot/ss_rich_pin_data_v1"  while taking
>> > > snapshot,
>> > > > no
>> > > > >> > > region was able to write into this direction. I also turn on
>> > DEBUG
>> > > > >> > logging
>> > > > >> > > on RS, all RS  just report fail with Timeout, with no
>> specific
>> > > > reason.
>> > > > >> > >
>> > > > >> > > Thanks
>> > > > >> > > Tian-Ying
>> > > > >> > >
>> > > > >> > > On Tue, May 19, 2015 at 11:06 AM, Esteban Gutierrez <
>> > > > >> > [email protected]>
>> > > > >> > > wrote:
>> > > > >> > >
>> > > > >> > >> Hi Tianying,
>> > > > >> > >>
>> > > > >> > >> Is this happening consistently in this region or is it
>> > happening
>> > > > >> > randomly
>> > > > >> > >> across other regions too? One possibility is that there was
>> a
>> > > split
>> > > > >> > going
>> > > > >> > >> on at the time you started to take the snapshot and it
>> failed.
>> > If
>> > > > you
>> > > > >> > look
>> > > > >> > >> into /hbase/rich_pin_data_v1 can you find a directory named
>> > > > >> > >> dff681880bb2b23d0351d6656a1dbbb9 in there?
>> > > > >> > >>
>> > > > >> > >> cheers,
>> > > > >> > >> esteban.
>> > > > >> > >>
>> > > > >> > >>
>> > > > >> > >> --
>> > > > >> > >> Cloudera, Inc.
>> > > > >> > >>
>> > > > >> > >>
>> > > > >> > >> On Mon, May 18, 2015 at 11:12 PM, Tianying Chang <
>> > > > [email protected]>
>> > > > >> > >> wrote:
>> > > > >> > >>
>> > > > >> > >> > Hi,
>> > > > >> > >> >
>> > > > >> > >> > We have a cluster that used to be able to take snapshot.
>> But
>> > > > >> recently,
>> > > > >> > >> one
>> > > > >> > >> > table failed due to the error below. Other tables on the
>> same
>> > > > >> clusters
>> > > > >> > >> are
>> > > > >> > >> > fine.
>> > > > >> > >> >
>> > > > >> > >> > Any idea what could go wrong? Is the table not healthy?
>> But I
>> > > run
>> > > > >> > hbase
>> > > > >> > >> > hbck, it reports cluster healthy.
>> > > > >> > >> >
>> > > > >> > >> > BTW, we are running 94.7, so we need to take snapshot of
>> the
>> > > data
>> > > > >> to
>> > > > >> > >> export
>> > > > >> > >> > to a new cluster of 94.26 as part of upgrade (and
>> eventually
>> > > > >> upgrade
>> > > > >> > to
>> > > > >> > >> > 1.x)
>> > > > >> > >> >
>> > > > >> > >> > Thanks
>> > > > >> > >> > Tian-Ying
>> > > > >> > >> >
>> > > > >> > >> >
>> > > > >> > >> > 015-05-19 06:00:45,505 ERROR
>> > > > >> > >> >
>> org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler:
>> > > > Failed
>> > > > >> > >> taking
>> > > > >> > >> > snapshot { ss=ss_rich_pin_data_v1 table=rich_pin_data_v1
>> > > > >> > type=SKIPFLUSH
>> > > > >> > >> }
>> > > > >> > >> > due to exception:No region directory found for
>> region:{NAME
>> > =>
>> > > > >> > >> >
>> > > > >>
>> 'rich_pin_data_v1,,1389319134976.dff681880bb2b23d0351d6656a1dbbb9.',
>> > > > >> > >> > STARTKEY => '', ENDKEY =>
>> '001ff3a165ff571471603035ca7b4be9',
>> > > > >> ENCODED
>> > > > >> > =>
>> > > > >> > >> > dff681880bb2b23d0351d6656a1dbbb9,}
>> > > > >> > >> >
>> org.apache.hadoop.hbase.snapshot.CorruptedSnapshotException:
>> > No
>> > > > >> region
>> > > > >> > >> > directory found for region:{NAME =>
>> > > > >> > >> >
>> > > > >>
>> 'rich_pin_data_v1,,1389319134976.dff681880bb2b23d0351d6656a1dbbb9.',
>> > > > >> > >> > STARTKEY => '', ENDKEY =>
>> '001ff3a165ff571471603035ca7b4be9',
>> > > > >> ENCODED
>> > > > >> > =>
>> > > > >> > >> > dff681880bb2b23d0351d6656a1dbbb9,}
>> > > > >> > >> >         at
>> > > > >> > >> >
>> > > > >> > >> >
>> > > > >> > >>
>> > > > >> >
>> > > > >>
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.master.snapshot.MasterSnapshotVerifier.verifyRegion(MasterSnapshotVerifier.java:167)
>> > > > >> > >> >         at
>> > > > >> > >> >
>> > > > >> > >> >
>> > > > >> > >>
>> > > > >> >
>> > > > >>
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.master.snapshot.MasterSnapshotVerifier.verifyRegions(MasterSnapshotVerifier.java:152)
>> > > > >> > >> >         at
>> > > > >> > >> >
>> > > > >> > >> >
>> > > > >> > >>
>> > > > >> >
>> > > > >>
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.master.snapshot.MasterSnapshotVerifier.verifySnapshot(MasterSnapshotVerifier.java:115)
>> > > > >> > >> >         at
>> > > > >> > >> >
>> > > > >> > >> >
>> > > > >> > >>
>> > > > >> >
>> > > > >>
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.process(TakeSnapshotHandler.java:156)
>> > > > >> > >> >         at
>> > > > >> > >> >
>> > > > >> >
>> > > >
>> > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175)
>> > > > >> > >> >         at
>> > > > >> > >> >
>> > > > >> > >> >
>> > > > >> > >>
>> > > > >> >
>> > > > >>
>> > > >
>> > >
>> >
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>> > > > >> > >> >         at
>> > > > >> > >> >
>> > > > >> > >> >
>> > > > >> > >>
>> > > > >> >
>> > > > >>
>> > > >
>> > >
>> >
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>> > > > >> > >> >         at java.lang.Thread.run(Thread.java:662)
>> > > > >> > >> > 2015-05-19 06:00:45,505 INFO
>> > > > >> > >> >
>> org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler:
>> > > Stop
>> > > > >> > taking
>> > > > >> > >> > snapshot={ ss=ss_rich_pin_data_v1 table=rich_pin_data_v1
>> > > > >> > type=SKIPFLUSH
>> > > > >> > >> }
>> > > > >> > >> > because: Failed to take snapshot '{ ss=ss_rich_pin_data_v1
>> > > > >> > >> > table=rich_pin_data_v1 type=SKIPFLUSH }' due to exception
>> > > > >> > >> > 2015-05-19 06:00:49,745 WARN
>> > org.apache.hadoop.ipc.HBaseServer:
>> > > > IPC
>> > > > >> > >> Server
>> > > > >> > >> > handler 50 on 60000 caught:
>> > > > >> java.lang.ArrayIndexOutOfBoundsException:
>> > > > >> > 2
>> > > > >> > >> >         at
>> java.util.Arrays$ArrayList.get(Arrays.java:3381)
>> > > > >> > >> >         at
>> > > > >> > >> >
>> > > java.util.Collections$UnmodifiableList.get(Collections.java:1152)
>> > > > >> > >> >         at
>> > > > >> > >> >
>> > > > >> > >> >
>> > > > >> > >>
>> > > > >> >
>> > > > >>
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.protobuf.generated.HBaseProtos$SnapshotDescription$Type.getValueDescriptor(HBaseProtos.java:99)
>> > > > >> > >> >         at
>> > sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>> > > > >> Method)
>> > > > >> > >> >         at
>> > > > >> > >> >
>> > > > >> > >> >
>> > > > >> > >>
>> > > > >> >
>> > > > >>
>> > > >
>> > >
>> >
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>> > > > >> > >> >         at
>> > > > >> > >> >
>> > > > >> > >> >
>> > > > >> > >>
>> > > > >> >
>> > > > >>
>> > > >
>> > >
>> >
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> > > > >> > >> >         at
>> java.lang.reflect.Method.invoke(Method.java:597)
>> > > > >> > >> >         at
>> > > > >> > >> >
>> > > > >> > >> >
>> > > > >> > >>
>> > > > >> >
>> > > > >>
>> > > >
>> > >
>> >
>> com.google.protobuf.GeneratedMessage.invokeOrDie(GeneratedMessage.java:1369)
>> > > > >> > >> >         at
>> > > > >> > >> >
>> > > > >> > >>
>> > > > >> >
>> > > > >>
>> > > >
>> > >
>> >
>> com.google.protobuf.GeneratedMessage.access$1400(GeneratedMessage.java:57)
>> > > > >> > >> >         at
>> > > > >> > >> >
>> > > > >> > >> >
>> > > > >> > >>
>> > > > >> >
>> > > > >>
>> > > >
>> > >
>> >
>> com.google.protobuf.GeneratedMessage$FieldAccessorTable$SingularEnumFieldAccessor.get(GeneratedMessage.java:1670)
>> > > > >> > >> >         at
>> > > > >> > >> >
>> > > > >> >
>> > > >
>> > com.google.protobuf.GeneratedMessage.getField(GeneratedMessage.java:162)
>> > > > >> > >> >         at
>> > > > >> > >> >
>> > > > >> > >> >
>> > > > >> > >>
>> > > > >> >
>> > > > >>
>> > > >
>> > >
>> >
>> com.google.protobuf.GeneratedMessage.getAllFieldsMutable(GeneratedMessage.java:113)
>> > > > >> > >> >         at
>> > > > >> > >> >
>> > > > >> > >> >
>> > > > >> > >>
>> > > > >> >
>> > > > >>
>> > > >
>> > >
>> >
>> com.google.protobuf.GeneratedMessage.getAllFields(GeneratedMessage.java:152)
>> > > > >> > >> >         at
>> > > > >> > >> >
>> > > com.google.protobuf.TextFormat$Printer.print(TextFormat.java:228)
>> > > > >> > >> >         at
>> > > > >> > >> >
>> > > > >>
>> > com.google.protobuf.TextFormat$Printer.access$200(TextFormat.java:217)
>> > > > >> > >> >         at
>> > > > com.google.protobuf.TextFormat.print(TextFormat.java:68)
>> > > > >> > >> >         at
>> > > > >> > >> >
>> > > com.google.protobuf.TextFormat.printToString(TextFormat.java:115)
>> > > > >> > >> >         at
>> > > > >> > >> >
>> > > > >>
>> > com.google.protobuf.AbstractMessage.toString(AbstractMessage.java:86)
>> > > > >> > >> >         at
>> > > > >> > >> >
>> > > > >> > >> >
>> > > > >> > >>
>> > > > >> >
>> > > > >>
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.snapshot.HSnapshotDescription.toString(HSnapshotDescription.java:72)
>> > > > >> > >> >         at java.lang.String.valueOf(String.java:2826)
>> > > > >> > >> >         at
>> > > java.lang.StringBuilder.append(StringBuilder.java:115)
>> > > > >> > >> >         at
>> > > > >> > >> >
>> > > > >>
>> org.apache.hadoop.hbase.ipc.Invocation.toString(Invocation.java:152)
>> > > > >> > >> >         at
>> > > > >> > >> >
>> > > > >> > >>
>> > > > >> >
>> > > > >>
>> > > >
>> > >
>> >
>> org.apache.hadoop.hbase.ipc.HBaseServer$Call.toString(HBaseServer.java:304)
>> > > > >> > >> >         at java.lang.String.valueOf(String.java:2826)
>> > > > >> > >> >         at
>> > > java.lang.StringBuilder.append(StringBuilder.java:115)
>> > > > >> > >> >
>> > > > >> > >>
>> > > > >> > >
>> > > > >> > >
>> > > > >> >
>> > > > >>
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: Failed to take snapshot due to some region directory is not found

Reply via email to