Writes stay in the memstore until the memstore limit is hit or a flush is otherwise invoked (periodic, etc). So it makes sense there is still a lot to be flushed when you do the snapshot.
To me this points to a disk write throughput issue, which makes sense since you only have 1 of them. During the normal issues you were getting are you sure you didn't see anything about blocking writes? On Thursday, November 6, 2014, Pere Kyle <p...@whisper.sh> wrote: > So I have another symptom which is quite odd. When trying to take a > snapshot of the the table with no writes coming in (i stopped thrift) it > continually times out when trying to flush (i don’t believe i have the > option of non flush in .94). Every single time I get: > > ERROR: org.apache.hadoop.hbase.snapshot.HBaseSnapshotException: > org.apache.hadoop.hbase.snapshot.HBaseSnapshotException: Snapshot { > ss=backup_weaver table=weaver_events type=FLUSH } had an error. Procedure > backup_weaver { > waiting=[ip-10-227-42-142.us-west-2.compute.internal,60020,1415302661297, > ip-10-227-42-252.us-west-2.compute.internal,60020,1415304752318, > ip-10-231-21-106.us-west-2.compute.internal,60020,1415306503049, > ip-10-230-130-102.us-west-2.compute.internal,60020,1415296951057, > ip-10-231-138-119.us-west-2.compute.internal,60020,1415303920176, > ip-10-224-53-183.us-west-2.compute.internal,60020,1415311138483, > ip-10-250-1-140.us-west-2.compute.internal,60020,1415311984665, > ip-10-227-40-150.us-west-2.compute.internal,60020,1415313275623, > ip-10-231-139-198.us-west-2.compute.internal,60020,1415295324957, > ip-10-250-77-76.us-west-2.compute.internal,60020,1415297345932, > ip-10-248-42-35.us-west-2.compute.internal,60020,1415312717768, > ip-10-227-45-74.us-west-2.compute.internal,60020,1415296135484, > ip-10-227-43-49.us-west-2.compute.internal,60020,1415303176867, > ip-10-230-130-121.us-west-2.compute.internal,60020,1415294726028, > ip-10-224-49-168.us-west-2.compute.internal,60020,1415312488614, > ip-10-227-0-82.us-west-2.compute.internal,60020,1415301974178, > ip-10-224-0-167.us-west-2.compute.internal,60020,1415309549108] done=[] } > at > org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:362) > at > org.apache.hadoop.hbase.master.HMaster.isSnapshotDone(HMaster.java:2313) > at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:354) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1434) > Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException via > timer-java.util.Timer@239e8159:org.apache.hadoop.hbase.errorhandling.TimeoutException: > Timeout elapsed! Source:Timeout caused Foreign Exception > Start:1415319201016, End:1415319261016, diff:60000, max:60000 ms > at > org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:85) > at > org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.rethrowExceptionIfFailed(TakeSnapshotHandler.java:285) > at > org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:352) > ... 6 more > Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout > elapsed! Source:Timeout caused Foreign Exception Start:1415319201016, > End:1415319261016, diff:60000, max:60000 ms > at > org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:68) > at java.util.TimerThread.mainLoop(Timer.java:555) > at java.util.TimerThread.run(Timer.java:505) > > > I do not have a single write coming in so how in the world could these > tables not be flushed? I could understand a error maybe the first time or > two, but how could it not be flushed after a couple requests? Now I can’t > even get the data off the node to a new cluster. Any help would be greatly > appreciated. > > Thanks, > -Pere > > > > On Nov 6, 2014, at 2:09 PM, Nick Dimiduk <ndimi...@gmail.com > <javascript:;>> wrote: > > > One other thought: you might try tracing your requests to see where the > > slowness happens. Recent versions of PerformanceEvaluation support this > > feature and can be used directly or as an example for adding tracing to > > your application. > > > > On Thursday, November 6, 2014, Pere Kyle <p...@whisper.sh> wrote: > > > >> Bryan, > >> > >> Thanks again for the incredibly useful reply. > >> > >> I have confirmed that the callQueueLen is in fact 0, with a max value > of 2 > >> in the last week (in ganglia) > >> > >> hbase.hstore.compaction.max was set to 15 on the nodes, from a previous > 7. > >> > >> Freezes (laggy responses) on the cluster are frequent and affect both > >> reads and writes. I noticed iowait on the nodes that spikes. > >> > >> The cluster goes between a state of working 100% to nothing > >> serving/timeouts for no discernible reason. > >> > >> Looking through the logs I have tons of responseTooSlow, this is the > only > >> regular occurrence in the logs: > >> > hbase-hadoop-regionserver-ip-10-230-130-121.us-west-2.compute.internal.log:2014-11-06 > >> 03:54:31,640 WARN org.apache.hadoop.ipc.HBaseServer (IPC Server handler > 39 > >> on 60020): (responseTooSlow): > >> > {"processingtimems":14573,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@c67b2ac > ), > >> rpc version=1, client version=29, > methodsFingerPrint=-540141542","client":" > >> 10.231.139.198:57223 > >> > ","starttimems":1415246057066,"queuetimems":20640,"class":"HRegionServer","responsesize":0,"method":"multi"} > >> > hbase-hadoop-regionserver-ip-10-230-130-121.us-west-2.compute.internal.log:2014-11-06 > >> 03:54:31,640 WARN org.apache.hadoop.ipc.HBaseServer (IPC Server handler > 42 > >> on 60020): (responseTooSlow): > >> > {"processingtimems":45660,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@6c034090 > ), > >> rpc version=1, client version=29, > methodsFingerPrint=-540141542","client":" > >> 10.231.21.106:41126 > >> > ","starttimems":1415246025979,"queuetimems":202,"class":"HRegionServer","responsesize":0,"method":"multi"} > >> > hbase-hadoop-regionserver-ip-10-230-130-121.us-west-2.compute.internal.log:2014-11-06 > >> 03:54:31,642 WARN org.apache.hadoop.ipc.HBaseServer (IPC Server handler > 46 > >> on 60020): (responseTooSlow): > >> > {"processingtimems":14620,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@4fc3bb1f > ), > >> rpc version=1, client version=29, > methodsFingerPrint=-540141542","client":" > >> 10.230.130.102:54068 > >> > ","starttimems":1415246057021,"queuetimems":27565,"class":"HRegionServer","responsesize":0,"method":"multi"} > >> > hbase-hadoop-regionserver-ip-10-230-130-121.us-west-2.compute.internal.log:2014-11-06 > >> 03:54:31,642 WARN org.apache.hadoop.ipc.HBaseServer (IPC Server handler > 35 > >> on 60020): (responseTooSlow): > >> > {"processingtimems":13431,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@3b321922 > ), > >> rpc version=1, client version=29, > methodsFingerPrint=-540141542","client":" > >> 10.227.42.252:60493 > >> > ","starttimems":1415246058210,"queuetimems":1134,"class":"HRegionServer","responsesize":0,"method":"multi"} > >> On Nov 6, 2014, at 12:38 PM, Bryan Beaudreault < > bbeaudrea...@hubspot.com <javascript:;> > >> <javascript:;>> wrote: > >> > >>> blockingStoreFiles > >> > >> > >