|
Hi, I've checked my 30 RPC handler, they are all in a WAITING state: Here is some extract for one of our RS (this is similar to all of them): requestsPerSecond=593, numberOfOnlineRegions=584, numberOfStores=1147, numberOfStorefiles=1980, storefileIndexSizeMB=15, rootIndexSizeKB=16219, totalStaticIndexSizeKB=246127, totalStaticBloomSizeKB=12936, memstoreSizeMB=1421, readRequestsCount=633241097, writeRequestsCount=9375846, compactionQueueSize=0, flushQueueSize=0, usedHeapMB=3042, maxHeapMB=4591, blockCacheSizeMB=890.19, blockCacheFreeMB=257.65, blockCacheCount=14048, blockCacheHitCount=5854936149, blockCacheMissCount=14761288, blockCacheEvictedCount=4870523, blockCacheHitRatio=99%, blockCacheHitCachingRatio=99%, hdfsBlocksLocalityIndex=29 Le 21/11/12 05:53, Alok Singh a écrit :
Do your PUTs and GETs have small amounts of data? If yes, then you can increase the number of handlers. We have a 8-node cluster on 0.92.1, and these are some of the setting we changed from 0.90.4hbase.regionserver.handler.count = 150 hbase.hregion.max.filesize=2147483648 (2GB) The regions servers are run with a 16GB heap (-Xmx16000M) With these settings, at peak we can handle ~2K concurrent clients. Alok On Tue, Nov 20, 2012 at 8:21 AM, Vincent Barat <[email protected]> wrote:Hi, We have changed some parameters on our 16(!) region servers : 1GB more -Xmx, more rpc handler (from 10 to 30) longer timeout, but nothing seems to improve the response time: - Scans with HBase 0.92 are x3 SLOWER than with HBase 0.90.3 - A lot of simultaneous gets lead to a huge slow down of batch put & ramdom read response time ... despite the fact that our RS CPU load is really low (10%) Note: we have not (yet) activated MSlabs, nor direct read on HDFS. Any idea please ? I'm really stuck on that issue. Best regards, Le 16/11/12 20:55, Vincent Barat a écrit :Hi, Right now (and previously with 0.90.3) we were using the default value (10). We are trying right now to increase to 30 to see if it is better. Thanks for your concern Le 16/11/12 18:13, Ted Yu a écrit :Vincent: What's the value for hbase.regionserver.handler.count ? I assume you keep the same value as that from 0.90.3 Thanks On Fri, Nov 16, 2012 at 8:14 AM, Vincent Barat<[email protected]>wrote:Le 16/11/12 01:56, Stack a écrit : On Thu, Nov 15, 2012 at 5:21 AM, Guillaume Perrot<[email protected]>wrote:It happens when several tables are being compacted and/or when there is several scanners running.It happens for a particular region? Anything you can tell about the server looking in your cluster monitoring? Is it running hot? What do the hbase regionserver stats in UI say? Anything interesting about compaction queues or requests?Hi, thanks for your answser Stack. I will take the lead on that thread from now on. It does not happens on any particular region. Actually, things get better now since compactions have been performed on all tables and have been stopped. Nevertheless, we face a dramatic decrease of performances (especially on random gets) of the overall cluster: Despite the fact we double our number of region servers (from 8 to 16) and despite the fact that these region server CPU load are just about 10% to 30%, performances are really bad : very often an light increase of request lead to a clients locked on request, very long response time. It looks like a contention / deadlock somewhere in the HBase client and C code.If you look at the thread dump all handlers are occupied serving requests? These timedout requests couldn't get into the server?We will investigate on that and report to you. Before the timeouts, we observe an increasing CPU load on a single regionserver and if we add region servers and wait for rebalancing, we always have the same region server causing problems like these: 2012-11-14 20:47:08,443 WARN org.apache.hadoop.ipc.**HBaseServer: IPC Server Responder, call multi(org.apache.hadoop.hbase.**client.MultiAction@2c3da1aa), rpc version=1, client version=29, methodsFingerPrint=54742778 from <ip>:45334: output error 2012-11-14 20:47:08,443 WARN org.apache.hadoop.ipc.**HBaseServer: IPC Server handler 3 on 60020 caught: java.nio.channels.** ClosedChannelException at sun.nio.ch.SocketChannelImpl.**ensureWriteOpen(** SocketChannelImpl.java:133) at sun.nio.ch.SocketChannelImpl.**write(SocketChannelImpl.java:**324) at org.apache.hadoop.hbase.ipc.**HBaseServer.channelWrite(** HBaseServer.java:1653) at org.apache.hadoop.hbase.ipc.**HBaseServer$Responder. processResponse(HBaseServer.**java:924) at org.apache.hadoop.hbase.ipc.**HBaseServer$Responder. doRespond(HBaseServer.java:**1003) at org.apache.hadoop.hbase.ipc.**HBaseServer$Call.**sendResponseIfReady( HBaseServer.java:409) at org.apache.hadoop.hbase.ipc.**HBaseServer$Handler.run(** HBaseServer.java:1346) With the same access patterns, we did not have this issue in HBase 0.90.3.The above is other side of the timeout -- the client is gone. Can you explain the rising CPU? --
Vincent Barat CTO
IMPORTANT NOTICE –
UBIKOD and CAPPTAIN are registered trademarks of UBIKOD
S.A.R.L., all copyrights are reserved. The contents of
this email and attachments are confidential and may be
subject to legal privilege and/or protected by copyright.
Copying or communicating any part of it to others is
prohibited and may be unlawful. If you are not the
intended recipient you must not use, copy, distribute or
rely on this email and should please return it immediately
or notify us by telephone. At present the integrity of
email across the Internet cannot be guaranteed. Therefore
UBIKOD S.A.R.L. will not accept liability for any claims
arising as a result of the use of this medium for
transmissions by or to UBIKOD S.A.R.L.. UBIKOD S.A.R.L.
may exercise any of its rights under relevant law, to
monitor the content of all electronic communications. You
should therefore be aware that this communication and any
responses might have been monitored, and may be accessed
by UBIKOD S.A.R.L. The views expressed in this document
are that of the individual and may not necessarily
constitute or imply its endorsement or recommendation by
UBIKOD S.A.R.L. The content of this electronic mail may be
subject to the confidentiality terms of a "Non-Disclosure
Agreement" (NDA).
|
- Lots of SocketTimeoutException for gets and puts since HB... Guillaume Perrot
- Re: Lots of SocketTimeoutException for gets and puts... Stack
- Re: Lots of SocketTimeoutException for gets and ... Vincent Barat
- Re: Lots of SocketTimeoutException for gets ... Ted Yu
- Re: Lots of SocketTimeoutException for g... Vincent Barat
- increasing block cache size Ted Tuttle
- Re: Lots of SocketTimeoutException for g... Vincent Barat
- X3 slow down after moving from HBas... Vincent Barat
- Re: X3 slow down after moving f... Alok Singh
- Re: X3 slow down after movi... Vincent Barat
- Re: X3 slow down after movi... Vincent Barat
- Re: X3 slow down after movi... Stack
- Re: X3 slow down after movi... Vincent Barat
- Re: X3 slow down after movi... Vincent Barat
- Re: X3 slow down after moving f... Stack
- Re: X3 slow down after movi... Vincent Barat
- Re: HBase scanner LeaseExce... Vincent Barat
