On Tue, Nov 20, 2012 at 8:21 AM, Vincent Barat <[email protected]> wrote: > We have changed some parameters on our 16(!) region servers : 1GB more -Xmx, > more rpc handler (from 10 to 30) longer timeout, but nothing seems to > improve the response time: >
You have taken a look at the perf chapter Vincent: http://hbase.apache.org/book.html#performance You carried forward your old hbase-default.xml or did you remove it (0.92 should have defaults in hbase-X.X.X.jar -- some defaults will have changed). > - Scans with HBase 0.92 are x3 SLOWER than with HBase 0.90.3 Any scan caching going on? > - A lot of simultaneous gets lead to a huge slow down of batch put & ramdom > read response time > The gets are returning lots of data? (If you thread dump the server at this time -- see at top of the regionserver UI -- can you see what we are hung up on? Are all handlers occupied?). > ... despite the fact that our RS CPU load is really low (10%) > As has been suggested earlier, perhaps up the handlers? > Note: we have not (yet) activated MSlabs, nor direct read on HDFS. > MSlab will help you avoid stop-the-world GCs. Direct read of HDFS should speed up random access. St.Ack > Any idea please ? I'm really stuck on that issue. > > Best regards, > > Le 16/11/12 20:55, Vincent Barat a écrit : >> >> Hi, >> >> Right now (and previously with 0.90.3) we were using the default value >> (10). >> We are trying right now to increase to 30 to see if it is better. >> >> Thanks for your concern >> >> Le 16/11/12 18:13, Ted Yu a écrit : >>> >>> Vincent: >>> What's the value for hbase.regionserver.handler.count ? >>> >>> I assume you keep the same value as that from 0.90.3 >>> >>> Thanks >>> >>> On Fri, Nov 16, 2012 at 8:14 AM, Vincent >>> Barat<[email protected]>wrote: >>> >>>> Le 16/11/12 01:56, Stack a écrit : >>>> >>>> On Thu, Nov 15, 2012 at 5:21 AM, Guillaume Perrot<[email protected]> >>>>> >>>>> wrote: >>>>> >>>>>> It happens when several tables are being compacted and/or when there >>>>>> is >>>>>> several scanners running. >>>>>> >>>>> It happens for a particular region? Anything you can tell about the >>>>> server looking in your cluster monitoring? Is it running hot? What >>>>> do the hbase regionserver stats in UI say? Anything interesting about >>>>> compaction queues or requests? >>>>> >>>> Hi, thanks for your answser Stack. I will take the lead on that thread >>>> from now on. >>>> >>>> It does not happens on any particular region. Actually, things get >>>> better >>>> now since compactions have been performed on all tables and have been >>>> stopped. >>>> >>>> Nevertheless, we face a dramatic decrease of performances (especially on >>>> random gets) of the overall cluster: >>>> >>>> Despite the fact we double our number of region servers (from 8 to 16) >>>> and >>>> despite the fact that these region server CPU load are just about 10% to >>>> 30%, performances are really bad : very often an light increase of >>>> request >>>> lead to a clients locked on request, very long response time. It looks >>>> like >>>> a contention / deadlock somewhere in the HBase client and C code. >>>> >>>> >>>> >>>>> If you look at the thread dump all handlers are occupied serving >>>>> requests? These timedout requests couldn't get into the server? >>>>> >>>> We will investigate on that and report to you. >>>> >>>> >>>> Before the timeouts, we observe an increasing CPU load on a single >>>> region >>>>>> >>>>>> server and if we add region servers and wait for rebalancing, we >>>>>> always >>>>>> have the same region server causing problems like these: >>>>>> >>>>>> 2012-11-14 20:47:08,443 WARN org.apache.hadoop.ipc.**HBaseServer: IPC >>>>>> Server Responder, call >>>>>> multi(org.apache.hadoop.hbase.**client.MultiAction@2c3da1aa), rpc >>>>>> version=1, client version=29, methodsFingerPrint=54742778 from >>>>>> <ip>:45334: output error >>>>>> 2012-11-14 20:47:08,443 WARN org.apache.hadoop.ipc.**HBaseServer: IPC >>>>>> Server handler 3 on 60020 caught: java.nio.channels.** >>>>>> ClosedChannelException >>>>>> at sun.nio.ch.SocketChannelImpl.**ensureWriteOpen(** >>>>>> SocketChannelImpl.java:133) >>>>>> at sun.nio.ch.SocketChannelImpl.**write(SocketChannelImpl.java:**324) >>>>>> at >>>>>> org.apache.hadoop.hbase.ipc.**HBaseServer.channelWrite(** >>>>>> HBaseServer.java:1653) >>>>>> at >>>>>> org.apache.hadoop.hbase.ipc.**HBaseServer$Responder. >>>>>> processResponse(HBaseServer.**java:924) >>>>>> at >>>>>> org.apache.hadoop.hbase.ipc.**HBaseServer$Responder. >>>>>> doRespond(HBaseServer.java:**1003) >>>>>> at >>>>>> org.apache.hadoop.hbase.ipc.**HBaseServer$Call.**sendResponseIfReady( >>>>>> HBaseServer.java:409) >>>>>> at >>>>>> org.apache.hadoop.hbase.ipc.**HBaseServer$Handler.run(** >>>>>> HBaseServer.java:1346) >>>>>> >>>>>> With the same access patterns, we did not have this issue in HBase >>>>>> 0.90.3. >>>>>> >>>>> The above is other side of the timeout -- the client is gone. >>>>> >>>>> Can you explain the rising CPU? >>>>> >>>> No there is no explanation (no high access a a given region for >>>> exemple). >>>> But this specific problem has gone when we finished compactions. >>>> >>>> >>>> Is it iowait on this box because of >>>>> >>>>> compactions? Bad disk? Always same regionserver or issue moves >>>>> around? >>>>> >>>>> Sorry for all the questions. 0.92 should be better than 0.90 >>>>> >>>> Our experience is currently the exact opposite : for us, 0.92 seems to >>>> be >>>> times slower than the 0.90.3. >>>> >>>> generally (0.94 even better still -- can you go there?). >>>> We can go to 0.94 but unfortunately, we CANNOT GO BACK (the same way we >>>> cannot go back to 0.90.3, since there is apparently a modification of >>>> the >>>> format of the ROOT table). >>>> The upgrade works, but the downgrade not. And we are afraid of having >>>> even >>>> more "new" problems with 0.94 and be forced to rollback to 0.90.3 (with >>>> some days of data loses). >>>> >>>> Thanks for your reply we will continue to investigate. >>>> >>>> >>>> >>>> Interesting >>>>> >>>>> that these issues show up post upgrade. I can't think of a reason why >>>>> the different versions would bring this on... >>>>> >>>>> St.Ack >>>>> >>>>> >> >> >
