Hi,
I applied the suggested timeout changes and it lasted longer before it started
failing, but its now just throwing the same errors on "procedure.Subprocedure:
Subprocedure pool is full!"
Looking through the logs, it looks like the nodes may not correctly updating
they completion state after finishing a snapshot, If i go through and restart
the offending region servers it will work again for a while, then stop again.
As far as I can tell, no snapshot is actually taking anywhere near the set
value of 10 minutes to run.
eg.
2014-09-25 21:03:15,713 INFO [regionserver16020-EventThread]
procedure.ZKProcedureMemberRpcs: Received procedure start children changed
event: /hbase/online-snapshot/acquired
2014-09-25 21:03:15,713 DEBUG [regionserver16020-EventThread]
procedure.ZKProcedureMemberRpcs: Looking for new procedures under
znode:'/hbase/online-snapshot/acquired'
2014-09-25 21:03:15,715 DEBUG [regionserver16020-EventThread]
procedure.ZKProcedureMemberRpcs: Found procedure znode:
/hbase/online-snapshot/acquired/requestData-20140925-210256
2014-09-25 21:03:15,718 DEBUG [regionserver16020-EventThread]
procedure.ZKProcedureMemberRpcs: start proc data length is 57
2014-09-25 21:03:15,718 DEBUG [regionserver16020-EventThread]
procedure.ZKProcedureMemberRpcs: Found data for
znode:/hbase/online-snapshot/acquired/requestData-20140925-210256
2014-09-25 21:03:15,718 DEBUG [regionserver16020-EventThread]
snapshot.RegionServerSnapshotManager: Launching subprocedure for snapshot
requestData-20140925-210256 from table requestData
2014-09-25 21:03:15,718 DEBUG [regionserver16020-EventThread]
procedure.ProcedureMember: Submitting new
Subprocedure:requestData-20140925-210256
2014-09-25 21:03:15,719 DEBUG [member: '###########-005,16020,1411640680181'
subprocedure-pool1-thread-1] procedure.Subprocedure: Starting subprocedure
'requestData-20140925-210256' with timeout 600000ms
2014-09-25 21:03:15,719 DEBUG [member: '###########-005,16020,1411640680181'
subprocedure-pool1-thread-1] errorhandling.TimeoutExceptionInjector: Scheduling
process timer to run in: 600000 ms
2014-09-25 21:03:15,720 DEBUG [member: '###########-005,16020,1411640680181'
subprocedure-pool1-thread-1] procedure.Subprocedure: Subprocedure
'requestData-20140925-210256' starting 'acquire' stage
2014-09-25 21:03:15,720 DEBUG [member: '###########-005,16020,1411640680181'
subprocedure-pool1-thread-1] procedure.Subprocedure: Subprocedure
'requestData-20140925-210256' locally acquired
2014-09-25 21:03:15,720 DEBUG [member: '###########-005,16020,1411640680181'
subprocedure-pool1-thread-1] procedure.ZKProcedureMemberRpcs: Member:
'###########-005,16020,1411640680181' joining acquired barrier fo
r procedure (requestData-20140925-210256) in zk
2014-09-25 21:03:15,724 DEBUG [member: '###########-005,16020,1411640680181'
subprocedure-pool1-thread-1] procedure.ZKProcedureMemberRpcs: Watch for global
barrier reached:/hbase/online-snapshot/reached/request
Data-20140925-210256
2014-09-25 21:03:15,739 DEBUG [member: '###########-005,16020,1411640680181'
subprocedure-pool1-thread-1] procedure.Subprocedure: Subprocedure
'requestData-20140925-210256' coordinator notified of 'acquire', wa
iting on 'reached' or 'abort' from coordinator
2014-09-25 21:03:15,825 INFO [regionserver16020-EventThread]
procedure.ZKProcedureMemberRpcs: Received created
event:/hbase/online-snapshot/reached/requestData-20140925-210256
2014-09-25 21:03:15,825 DEBUG [regionserver16020-EventThread]
procedure.ZKProcedureMemberRpcs: Recieved reached global
barrier:/hbase/online-snapshot/reached/requestData-20140925-210256
2014-09-25 21:03:15,825 DEBUG [member: '###########-005,16020,1411640680181'
subprocedure-pool1-thread-1] procedure.Subprocedure: Subprocedure
'requestData-20140925-210256' received 'reached' from coordinator.
...
2014-09-25 21:03:38,026 DEBUG [member: '###########-005,16020,1411640680181'
subprocedure-pool1-thread-1] snapshot.RegionServerSnapshotManager: Completed 5
local region snapshots.
2014-09-25 21:03:38,026 DEBUG [member: '###########-005,16020,1411640680181'
subprocedure-pool1-thread-1] snapshot.RegionServerSnapshotManager: cancelling 0
tasks for snapshot ###########-005,16020,1411640680181
2014-09-25 21:03:38,026 DEBUG [member: '###########-005,16020,1411640680181'
subprocedure-pool1-thread-1] procedure.Subprocedure: Subprocedure
'requestData-20140925-210256' locally completed
2014-09-25 21:03:38,026 DEBUG [member: '###########-005,16020,1411640680181'
subprocedure-pool1-thread-1] procedure.ZKProcedureMemberRpcs: Marking procedure
'requestData-20140925-210256' completed for member '
###########-005,16020,1411640680181' in zk
2014-09-25 21:03:38,028 DEBUG [member: '###########-005,16020,1411640680181'
subprocedure-pool1-thread-1] procedure.Subprocedure: Subprocedure
'requestData-20140925-210256' has notified controller of completion
2014-09-25 21:03:38,028 DEBUG [member: '###########-005,16020,1411640680181'
subprocedure-pool1-thread-1] errorhandling.TimeoutExceptionInjector: Marking
timer as complete - no error notifications will be recei
ved for this timer.
2014-09-25 21:03:38,028 DEBUG [member: '###########-005,16020,1411640680181'
subprocedure-pool1-thread-1] procedure.Subprocedure: Subprocedure
'requestData-20140925-210256' completed.
2
Which suggests that "requestData-20140925-210256" completed happily, however,
10 minutes later the following comes up in the logs
2014-09-25 21:13:15,777 INFO [regionserver16020-EventThread]
procedure.ZKProcedureMemberRpcs: Received created
event:/hbase/online-snapshot/abort/requestData-20140925-210256
2014-09-25 21:13:15,777 DEBUG [regionserver16020-EventThread]
procedure.ZKProcedureMemberRpcs: Aborting procedure member for znode
/hbase/online-snapshot/abort/requestData-20140925-210256
2014-09-25 21:13:15,778 DEBUG [regionserver16020-EventThread]
procedure.ProcedureMember: Request received to abort procedure
requestData-20140925-210256
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via
timer-java.util.Timer@5cb99ea6:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed!
Source:Timeout caused Foreign Exception Start:1411675395727, End:1411675995727,
diff:600000, max:600000 ms
at
org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
at
org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320)
at
org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeCreated(ZKProcedureMemberRpcs.java:94)
at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:327)
at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by:
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed!
Source:Timeout caused Foreign Exception Start:1411675395727, End:1411675995727,
diff:600000, max:600000 ms
at
org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:70)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)
2014-09-25 21:13:15,780 INFO [regionserver16020-EventThread]
procedure.ProcedureMember: Received abort on procedure with no local
subprocedure requestData-20140925-210256, ignoring it.
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via
timer-java.util.Timer@5cb99ea6:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed!
Source:Timeout caused Foreign Exception Start:1411675395727, End:1411675995727,
diff:600000, max:600000 ms
at
org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
at
org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320)
at
org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs$1.nodeCreated(ZKProcedureMemberRpcs.java:94)
at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:327)
at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by:
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed!
Source:Timeout caused Foreign Exception Start:1411675395727, End:1411675995727,
diff:600000, max:600000 ms
at
org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:70)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)
-Ian Brooks
On Tuesday 23 September 2014 06:52:34 Ted Yu wrote:
> Here're the config parameters related to controlling snapshot timeout:
>
> <property>
> <name>hbase.snapshot.master.timeoutMillis</name>
> <!-- Change from default of 60s to 600s to allow for slow flushing of
> tables -->
> <value>600000</value>
> <description>
> This is the time HBase master waits for the snapshot operation to
> complete.
> Do not confuse this hbase.snapshot.master.timeout.millis, which
> although
> sounding similar, serves a very different purpose.
> Note: This property has a completely different meaning before hbase
> version
> 0.94.11 and should not enabled on a cluster using snapshots and
> running
> a version before 0.94.11.
> </description>
> </property>
> <property>
> <name>hbase.snapshot.master.timeout.millis</name>
> <!-- Change from default of 60s to 600s to allow for slow flushing of
> tables -->
> <value>600000</value>
> <description>
> This is the timeout the master indicates the client to wait when it
> takes
> the snapshot. The client actually waits longer than this due to
> exponential
> backoff. See HBaseAdmin.snapshot for the exact algorithm.
> </description>
> </property>
> <property>
> <name>hbase.snapshot.region.timeout</name>
> <!-- Change from default of 60s to 600s to allow for slow flushing of
> tables -->
> <value>600000</value>
> <description>
> This is the time the regionserver waits to complete all of its
> activities
> for a snapshot operation.
> </description>
> </property>