Vincent: What's the value for hbase.regionserver.handler.count ? I assume you keep the same value as that from 0.90.3
Thanks On Fri, Nov 16, 2012 at 8:14 AM, Vincent Barat <[email protected]>wrote: > Le 16/11/12 01:56, Stack a écrit : > > On Thu, Nov 15, 2012 at 5:21 AM, Guillaume Perrot <[email protected]> >> wrote: >> >>> It happens when several tables are being compacted and/or when there is >>> several scanners running. >>> >> >> It happens for a particular region? Anything you can tell about the >> server looking in your cluster monitoring? Is it running hot? What >> do the hbase regionserver stats in UI say? Anything interesting about >> compaction queues or requests? >> > > Hi, thanks for your answser Stack. I will take the lead on that thread > from now on. > > It does not happens on any particular region. Actually, things get better > now since compactions have been performed on all tables and have been > stopped. > > Nevertheless, we face a dramatic decrease of performances (especially on > random gets) of the overall cluster: > > Despite the fact we double our number of region servers (from 8 to 16) and > despite the fact that these region server CPU load are just about 10% to > 30%, performances are really bad : very often an light increase of request > lead to a clients locked on request, very long response time. It looks like > a contention / deadlock somewhere in the HBase client and C code. > > > >> If you look at the thread dump all handlers are occupied serving >> requests? These timedout requests couldn't get into the server? >> > We will investigate on that and report to you. > > > Before the timeouts, we observe an increasing CPU load on a single region >>> server and if we add region servers and wait for rebalancing, we always >>> have the same region server causing problems like these: >>> >>> 2012-11-14 20:47:08,443 WARN org.apache.hadoop.ipc.**HBaseServer: IPC >>> Server Responder, call >>> multi(org.apache.hadoop.hbase.**client.MultiAction@2c3da1aa), rpc >>> version=1, client version=29, methodsFingerPrint=54742778 from >>> <ip>:45334: output error >>> 2012-11-14 20:47:08,443 WARN org.apache.hadoop.ipc.**HBaseServer: IPC >>> Server handler 3 on 60020 caught: java.nio.channels.** >>> ClosedChannelException >>> at sun.nio.ch.SocketChannelImpl.**ensureWriteOpen(** >>> SocketChannelImpl.java:133) >>> at sun.nio.ch.SocketChannelImpl.**write(SocketChannelImpl.java:**324) >>> at >>> org.apache.hadoop.hbase.ipc.**HBaseServer.channelWrite(** >>> HBaseServer.java:1653) >>> at >>> org.apache.hadoop.hbase.ipc.**HBaseServer$Responder. >>> processResponse(HBaseServer.**java:924) >>> at >>> org.apache.hadoop.hbase.ipc.**HBaseServer$Responder. >>> doRespond(HBaseServer.java:**1003) >>> at >>> org.apache.hadoop.hbase.ipc.**HBaseServer$Call.**sendResponseIfReady( >>> HBaseServer.java:409) >>> at >>> org.apache.hadoop.hbase.ipc.**HBaseServer$Handler.run(** >>> HBaseServer.java:1346) >>> >>> With the same access patterns, we did not have this issue in HBase >>> 0.90.3. >>> >> >> The above is other side of the timeout -- the client is gone. >> >> Can you explain the rising CPU? >> > No there is no explanation (no high access a a given region for exemple). > But this specific problem has gone when we finished compactions. > > > Is it iowait on this box because of >> compactions? Bad disk? Always same regionserver or issue moves >> around? >> >> Sorry for all the questions. 0.92 should be better than 0.90 >> > Our experience is currently the exact opposite : for us, 0.92 seems to be > times slower than the 0.90.3. > > generally (0.94 even better still -- can you go there?). >> > > We can go to 0.94 but unfortunately, we CANNOT GO BACK (the same way we > cannot go back to 0.90.3, since there is apparently a modification of the > format of the ROOT table). > The upgrade works, but the downgrade not. And we are afraid of having even > more "new" problems with 0.94 and be forced to rollback to 0.90.3 (with > some days of data loses). > > Thanks for your reply we will continue to investigate. > > > > Interesting >> that these issues show up post upgrade. I can't think of a reason why >> the different versions would bring this on... >> >> St.Ack >> >>
