Subprocedure pool is full. How many snapshot requests did you submit before seeing the following ?
Cheers On Sep 23, 2014, at 2:28 AM, Ian Brooks <[email protected]> wrote: > Hi, > > I'm seeing an issue on our hbase cluster which is preventing snapshots from > working. So far the only way i can get it working again is to restart all the > regionservers which is not ideal. > > I'm running hbase 0.98.3 on hadoop 2.4.0, the stacktraces are > > From host given the snapshot command via hbase shell > > 2014-09-23 10:00:16,116 INFO [regionserver16020-EventThread] > procedure.ZKProcedureMemberRpcs: Received procedure start children changed > event: /hbase/online-snapshot/acquired > 2014-09-23 10:00:16,116 DEBUG [regionserver16020-EventThread] > procedure.ZKProcedureMemberRpcs: Looking for new procedures under > znode:'/hbase/online-snapshot/acquired' > 2014-09-23 10:00:16,117 DEBUG [regionserver16020-EventThread] > procedure.ZKProcedureMemberRpcs: Found procedure znode: > /hbase/online-snapshot/acquired/offlineScans-20140923-100009 > 2014-09-23 10:00:16,117 DEBUG [regionserver16020-EventThread] > procedure.ZKProcedureMemberRpcs: start proc data length is 59 > 2014-09-23 10:00:16,117 DEBUG [regionserver16020-EventThread] > procedure.ZKProcedureMemberRpcs: Found data for > znode:/hbase/online-snapshot/acquired/offlineScans-20140923-100009 > 2014-09-23 10:00:16,118 DEBUG [regionserver16020-EventThread] > snapshot.RegionServerSnapshotManager: Launching subprocedure for snapshot > offlineScans-20140923-100009 from table offlineScans > 2014-09-23 10:00:16,118 DEBUG [regionserver16020-EventThread] > procedure.ProcedureMember: Submitting new > Subprocedure:offlineScans-20140923-100009 > 2014-09-23 10:00:16,120 DEBUG [member: '############,16020,1410126483043' > subprocedure-pool1-thread-314] procedure.Subprocedure: Starting subprocedure > 'offlineScans-20140923-100009' with timeout 60000ms > 2014-09-23 10:00:16,120 DEBUG [member: '############,16020,1410126483043' > subprocedure-pool1-thread-314] errorhandling.TimeoutExceptionInjector: > Scheduling process timer to run in: 60000 ms > 2014-09-23 10:00:16,121 DEBUG [member: '############,16020,1410126483043' > subprocedure-pool1-thread-314] procedure.Subprocedure: Subprocedure > 'offlineScans-20140923-100009' starting 'acquire' stage > 2014-09-23 10:00:16,121 DEBUG [member: '############,16020,1410126483043' > subprocedure-pool1-thread-314] procedure.Subprocedure: Subprocedure > 'offlineScans-20140923-100009' locally acquired > 2014-09-23 10:00:16,121 DEBUG [member: '############,16020,1410126483043' > subprocedure-pool1-thread-314] procedure.ZKProcedureMemberRpcs: Member: > '##########,16020,1410126483043' joining acquired barrier for procedure > (offlineScans-20140923-100009) in zk > 2014-09-23 10:00:16,123 DEBUG [member: '############,16020,1410126483043' > subprocedure-pool1-thread-314] procedure.ZKProcedureMemberRpcs: Watch for > global barrier > reached:/hbase/online-snapshot/reached/offlineScans-20140923-100009 > 2014-09-23 10:00:16,124 DEBUG [member: '############,16020,1410126483043' > subprocedure-pool1-thread-314] procedure.Subprocedure: Subprocedure > 'offlineScans-20140923-100009' coordinator notified of 'acquire', waiting on > 'reached' or 'abort' from coordinator > 2014-09-23 10:00:16,133 INFO [regionserver16020-EventThread] > procedure.ZKProcedureMemberRpcs: Received created > event:/hbase/online-snapshot/abort/offlineScans-20140923-100009 > 2014-09-23 10:00:16,134 DEBUG [regionserver16020-EventThread] > procedure.ZKProcedureMemberRpcs: Aborting procedure member for znode > /hbase/online-snapshot/abort/offlineScans-20140923-100009 > 2014-09-23 10:00:16,134 DEBUG [regionserver16020-EventThread] > procedure.ProcedureMember: Request received to abort procedure > offlineScans-20140923-100009 > org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via > ###############,16020,1408633802567:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: > > at > org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171) > at > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320) > at > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeCreated(ZKProcedureMemberRpcs.java:94) > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:327) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) > Caused by: > org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: > at > org.apache.hadoop.hbase.procedure.Subprocedure.cancel(Subprocedure.java:270) > at > org.apache.hadoop.hbase.procedure.ProcedureMember.submitSubprocedure(ProcedureMember.java:171) > at > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.startNewSubprocedure(ZKProcedureMemberRpcs.java:214) > at > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.waitForNewProcedures(ZKProcedureMemberRpcs.java:172) > at > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.access$100(ZKProcedureMemberRpcs.java:55) > at > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeChildrenChanged(ZKProcedureMemberRpcs.java:107) > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:348) > ... 2 more > > > > From the regionserver it was trying to contact > > > 2014-09-23 10:00:16,124 INFO [regionserver16020-EventThread] > procedure.ZKProcedureMemberRpcs: Received procedure start children changed > event: /hbase/online-snapshot/acquired > 2014-09-23 10:00:16,124 DEBUG [regionserver16020-EventThread] > procedure.ZKProcedureMemberRpcs: Looking for new procedures under > znode:'/hbase/online-snapshot/acquired' > 2014-09-23 10:00:16,125 DEBUG [regionserver16020-EventThread] > procedure.ZKProcedureMemberRpcs: Found procedure znode: > /hbase/online-snapshot/acquired/offlineScans-20140923-100009 > 2014-09-23 10:00:16,127 DEBUG [regionserver16020-EventThread] > procedure.ZKProcedureMemberRpcs: start proc data length is 59 > 2014-09-23 10:00:16,127 DEBUG [regionserver16020-EventThread] > procedure.ZKProcedureMemberRpcs: Found data for > znode:/hbase/online-snapshot/acquired/offlineScans-20140923-100009 > 2014-09-23 10:00:16,127 DEBUG [regionserver16020-EventThread] > snapshot.RegionServerSnapshotManager: Launching subprocedure for snapshot > offlineScans-20140923-100009 from table offlineScans > 2014-09-23 10:00:16,127 DEBUG [regionserver16020-EventThread] > procedure.ProcedureMember: Submitting new > Subprocedure:offlineScans-20140923-100009 > 2014-09-23 10:00:16,128 ERROR [regionserver16020-EventThread] > procedure.Subprocedure: Subprocedure pool is full! > 2014-09-23 10:00:16,128 DEBUG [regionserver16020-EventThread] > errorhandling.ForeignExceptionDispatcher: offlineScans-20140923-100009 > accepting received exception > 2014-09-23 10:00:16,128 DEBUG [regionserver16020-EventThread] > procedure.ZKProcedureMemberRpcs: Aborting procedure > (offlineScans-20140923-100009) in zk > 2014-09-23 10:00:16,134 DEBUG [regionserver16020-EventThread] > procedure.ZKProcedureMemberRpcs: Finished creating abort > znode:/hbase/online-snapshot/abort/offlineScans-20140923-100009 > 2014-09-23 10:00:16,134 ERROR [regionserver16020-EventThread] > procedure.ProcedureMember: Failed to start subprocedure > 'offlineScans-20140923-100009' > 2014-09-23 10:00:16,134 INFO [regionserver16020-EventThread] > procedure.ZKProcedureMemberRpcs: Received created > event:/hbase/online-snapshot/abort/offlineScans-20140923-100009 > 2014-09-23 10:00:16,134 DEBUG [regionserver16020-EventThread] > procedure.ZKProcedureMemberRpcs: Aborting procedure member for znode > /hbase/online-snapshot/abort/offlineScans-20140923-100009 > 2014-09-23 10:00:16,145 DEBUG [regionserver16020-EventThread] > procedure.ProcedureMember: Request received to abort procedure > offlineScans-20140923-100009 > org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via > ###########,16020,1408633802567:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: > at > org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171) > at > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320) > at > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeCreated(ZKProcedureMemberRpcs.java:94) > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:327) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) > Caused by: > org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: > at > org.apache.hadoop.hbase.procedure.Subprocedure.cancel(Subprocedure.java:270) > at > org.apache.hadoop.hbase.procedure.ProcedureMember.submitSubprocedure(ProcedureMember.java:171) > at > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.startNewSubprocedure(ZKProcedureMemberRpcs.java:214) > at > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.waitForNewProcedures(ZKProcedureMemberRpcs.java:172) > at > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.access$100(ZKProcedureMemberRpcs.java:55) > at > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeChildrenChanged(ZKProcedureMemberRpcs.java:107) > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:348) > ... 2 more > > Any ideas on how I can resolve this? > > -Ian >
