Re: Hbase Unusable after auto split to 1024 regions

Bryan Beaudreault Thu, 06 Nov 2014 18:55:57 -0800

Writes stay in the memstore until the memstore limit is hit or a flush is
otherwise invoked (periodic, etc). So it makes sense there is still a lot
to be flushed when you do the snapshot.


To me this points to a disk write throughput issue, which makes sense since
you only have 1 of them.

During the normal issues you were getting are you sure you didn't see
anything about blocking writes?

On Thursday, November 6, 2014, Pere Kyle <p...@whisper.sh> wrote:

> So I have another symptom which is quite odd. When trying to take a
> snapshot of the the table with no writes coming in (i stopped thrift) it
> continually times out when trying to flush (i don’t believe i have the
> option of non flush in .94). Every single time I get:
>
> ERROR: org.apache.hadoop.hbase.snapshot.HBaseSnapshotException:
> org.apache.hadoop.hbase.snapshot.HBaseSnapshotException: Snapshot {
> ss=backup_weaver table=weaver_events type=FLUSH } had an error.  Procedure
> backup_weaver {
> waiting=[ip-10-227-42-142.us-west-2.compute.internal,60020,1415302661297,
> ip-10-227-42-252.us-west-2.compute.internal,60020,1415304752318,
> ip-10-231-21-106.us-west-2.compute.internal,60020,1415306503049,
> ip-10-230-130-102.us-west-2.compute.internal,60020,1415296951057,
> ip-10-231-138-119.us-west-2.compute.internal,60020,1415303920176,
> ip-10-224-53-183.us-west-2.compute.internal,60020,1415311138483,
> ip-10-250-1-140.us-west-2.compute.internal,60020,1415311984665,
> ip-10-227-40-150.us-west-2.compute.internal,60020,1415313275623,
> ip-10-231-139-198.us-west-2.compute.internal,60020,1415295324957,
> ip-10-250-77-76.us-west-2.compute.internal,60020,1415297345932,
> ip-10-248-42-35.us-west-2.compute.internal,60020,1415312717768,
> ip-10-227-45-74.us-west-2.compute.internal,60020,1415296135484,
> ip-10-227-43-49.us-west-2.compute.internal,60020,1415303176867,
> ip-10-230-130-121.us-west-2.compute.internal,60020,1415294726028,
> ip-10-224-49-168.us-west-2.compute.internal,60020,1415312488614,
> ip-10-227-0-82.us-west-2.compute.internal,60020,1415301974178,
> ip-10-224-0-167.us-west-2.compute.internal,60020,1415309549108] done=[] }
>         at
> org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:362)
>         at
> org.apache.hadoop.hbase.master.HMaster.isSnapshotDone(HMaster.java:2313)
>         at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at
> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:354)
>         at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1434)
> Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException via
> timer-java.util.Timer@239e8159:org.apache.hadoop.hbase.errorhandling.TimeoutException:
> Timeout elapsed! Source:Timeout caused Foreign Exception
> Start:1415319201016, End:1415319261016, diff:60000, max:60000 ms
>         at
> org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:85)
>         at
> org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.rethrowExceptionIfFailed(TakeSnapshotHandler.java:285)
>         at
> org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:352)
>         ... 6 more
> Caused by: org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout
> elapsed! Source:Timeout caused Foreign Exception Start:1415319201016,
> End:1415319261016, diff:60000, max:60000 ms
>         at
> org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:68)
>         at java.util.TimerThread.mainLoop(Timer.java:555)
>         at java.util.TimerThread.run(Timer.java:505)
>
>
> I do not have a single write coming in so how in the world could these
> tables not be flushed? I could understand a error maybe the first time or
> two, but how could it not be flushed after a couple requests? Now I can’t
> even get the data off the node to a new cluster. Any help would be greatly
> appreciated.
>
> Thanks,
> -Pere
>
>
>
> On Nov 6, 2014, at 2:09 PM, Nick Dimiduk <ndimi...@gmail.com
> <javascript:;>> wrote:
>
> > One other thought: you might try tracing your requests to see where the
> > slowness happens. Recent versions of PerformanceEvaluation support this
> > feature and can be used directly or as an example for adding tracing to
> > your application.
> >
> > On Thursday, November 6, 2014, Pere Kyle <p...@whisper.sh> wrote:
> >
> >> Bryan,
> >>
> >> Thanks again for the incredibly useful reply.
> >>
> >> I have confirmed that the callQueueLen is in fact 0, with a max value
> of 2
> >> in the last week (in ganglia)
> >>
> >> hbase.hstore.compaction.max was set to 15 on the nodes, from a previous
> 7.
> >>
> >> Freezes (laggy responses) on the cluster are frequent and affect both
> >> reads and writes. I noticed iowait on the nodes that spikes.
> >>
> >> The cluster goes between a state of working 100% to nothing
> >> serving/timeouts for no discernible reason.
> >>
> >> Looking through the logs I have tons of responseTooSlow, this is the
> only
> >> regular occurrence in the logs:
> >>
> hbase-hadoop-regionserver-ip-10-230-130-121.us-west-2.compute.internal.log:2014-11-06
> >> 03:54:31,640 WARN org.apache.hadoop.ipc.HBaseServer (IPC Server handler
> 39
> >> on 60020): (responseTooSlow):
> >>
> {"processingtimems":14573,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@c67b2ac
> ),
> >> rpc version=1, client version=29,
> methodsFingerPrint=-540141542","client":"
> >> 10.231.139.198:57223
> >>
> ","starttimems":1415246057066,"queuetimems":20640,"class":"HRegionServer","responsesize":0,"method":"multi"}
> >>
> hbase-hadoop-regionserver-ip-10-230-130-121.us-west-2.compute.internal.log:2014-11-06
> >> 03:54:31,640 WARN org.apache.hadoop.ipc.HBaseServer (IPC Server handler
> 42
> >> on 60020): (responseTooSlow):
> >>
> {"processingtimems":45660,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@6c034090
> ),
> >> rpc version=1, client version=29,
> methodsFingerPrint=-540141542","client":"
> >> 10.231.21.106:41126
> >>
> ","starttimems":1415246025979,"queuetimems":202,"class":"HRegionServer","responsesize":0,"method":"multi"}
> >>
> hbase-hadoop-regionserver-ip-10-230-130-121.us-west-2.compute.internal.log:2014-11-06
> >> 03:54:31,642 WARN org.apache.hadoop.ipc.HBaseServer (IPC Server handler
> 46
> >> on 60020): (responseTooSlow):
> >>
> {"processingtimems":14620,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@4fc3bb1f
> ),
> >> rpc version=1, client version=29,
> methodsFingerPrint=-540141542","client":"
> >> 10.230.130.102:54068
> >>
> ","starttimems":1415246057021,"queuetimems":27565,"class":"HRegionServer","responsesize":0,"method":"multi"}
> >>
> hbase-hadoop-regionserver-ip-10-230-130-121.us-west-2.compute.internal.log:2014-11-06
> >> 03:54:31,642 WARN org.apache.hadoop.ipc.HBaseServer (IPC Server handler
> 35
> >> on 60020): (responseTooSlow):
> >>
> {"processingtimems":13431,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@3b321922
> ),
> >> rpc version=1, client version=29,
> methodsFingerPrint=-540141542","client":"
> >> 10.227.42.252:60493
> >>
> ","starttimems":1415246058210,"queuetimems":1134,"class":"HRegionServer","responsesize":0,"method":"multi"}
> >> On Nov 6, 2014, at 12:38 PM, Bryan Beaudreault <
> bbeaudrea...@hubspot.com <javascript:;>
> >> <javascript:;>> wrote:
> >>
> >>> blockingStoreFiles
> >>
> >>
>
>

Re: Hbase Unusable after auto split to 1024 regions

Reply via email to