Thanks for the quick responses. I’ll get back on this later; I discovered that HBase didn’t restart properly after changing the timeouts, so the second ERROR may be a side-effect of that.
I also just discovered that the table in question was not pre-split properly, and the region distribution is screwed up. So I’ll clean up the mess and try again tomorrow. Regrets for the possible false alarm Brian On Oct 8, 2014, at 3:25 PM, Brian Jeltema <[email protected]> wrote: > Sorry, I usually include that info. HBase version is 0.98. hbase.rpc.timeout > is the default. > > When the ‘ERROR: Call id….’ occurred, there was no stack trace. That was the > entire error output. > > Before I increased the snapshot timeout parameters, the timeout I was seeing > looked like: > > ERROR: org.apache.hadoop.hbase.snapshot.HBaseSnapshotException: Snapshot { > ss=Host-bdj table=Host type=FLUSH } had an error. Procedure Host-bdj { > waiting=[] done=[host-22.hdfs.foo.net,60020,1410543068459, > host-24.hdfs.foo.net,60020,1412603246174, > host-17.hdfs.foo.net,60020,1410543059186, > host-19.hdfs.foo.net,60020,1412419924491, > host-20.hdfs.foo.net,60020,1412419942143, > host-16.hdfs.foo.net,60020,1403178964733, > host-15.hdfs.foo.net,60020,1403178962029, > host-21.hdfs.foo.net,60020,1403178959748, > host-23.hdfs.foo.net,60020,1410543079248, > host-18.hdfs.foo.net,60020,1410543061865] } > at > org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:366) > at > org.apache.hadoop.hbase.master.HMaster.isSnapshotDone(HMaster.java:2993) > at > org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java:38245) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92) > at > org.apache.hadoop.hbase.ipc.FifoRpcScheduler$1.run(FifoRpcScheduler.java:73) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException via > timer-java.util.Timer@3097c4e1:org.apache.hadoop.hbase.errorhandling.TimeoutException: > Timeout elapsed! Source:Timeout caused Foreign Exception > Start:1412792382137, End:1412792442137, diff:60000, max:60000 ms > at > org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:83) > at > org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.rethrowExceptionIfFailed(TakeSnapshotHandler.java:318) > at > org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:356) > ... 10 more > Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout > elapsed! Source:Timeout caused Foreign Exception Start:1412792382137, > End:1412792442137, diff:60000, max:60000 ms > at > org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:67) > at java.util.TimerThread.mainLoop(Timer.java:555) > at java.util.TimerThread.run(Timer.java:505) > > On Oct 8, 2014, at 3:18 PM, Ted Yu <[email protected]> wrote: > >> Can you give a bit more information : >> >> the release of hbase you're using >> value for hbase.rpc.timeout (looks like you leave it @ default) >> more of the error (please include stack trace if possible) >> >> Cheers >> >> On Wed, Oct 8, 2014 at 12:09 PM, Brian Jeltema < >> [email protected]> wrote: >> >>> I’m trying to snapshot a moderately large table (3 billion rows, but not a >>> huge amount of data per row). >>> Those snapshots have been timing out, so I set the following parameters to >>> relatively large values: >>> >>> hbase.snapshot.master.timeoutMillis >>> hbase.snapshot.region.timeout >>> hbase.snapshot.master.timeout.millis >>> >>> A snapshot attempt then resulted in the terse result: >>> >>> ERROR: Call id=13, waitTime=60060, rpcTimeout=60000 >>> >>> A brief review of some of the hbase log files didn’t reveal anything (but >>> there are many). >>> How should I pursue getting these snapshots to work? >>> >>> Brian >
