RE: Average RPC Queue Time

Vladimir Rodionov Wed, 20 Nov 2013 09:08:45 -0800

>>The RpcQueueTime metrics are a measurement of how long individual calls
>>stay in this queued state.  If your handlers were never 100% occupied, this
>>value would be 0.  An average of 3 hours is concerning, it basically means
>>that when a call comes into the RegionServer it takes on average 3 hours to
>>start processing, because handlers are all occupied for that amount of time.

Definitely, this metric is meaningless because default RPC timeout is 60 sec 
and under no circumstances
call data can survive this 60 sec in a callQueue unless we have  a bug.

Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: [email protected]

________________________________________
From: Bryan Beaudreault [[email protected]]
Sent: Wednesday, November 20, 2013 8:55 AM
To: [email protected]
Subject: Re: Average RPC Queue Time

A regionserver is configured with a certain number of RPC handlers
(hbase.regionserver.handler.count).  When these handlers are all occupied,
the calls back up into a callQueue.  This call queue is bounded by
ipc.server.max.callqueue.size (defaulting to 1GB of serialized requests)
and ipc.server.max.callqueue.length (10 * numHandlers).  So, with 5
handlers a maximum of 50 calls will be queued up before requests are
rejected outright.

The RpcQueueTime metrics are a measurement of how long individual calls
stay in this queued state.  If your handlers were never 100% occupied, this
value would be 0.  An average of 3 hours is concerning, it basically means
that when a call comes into the RegionServer it takes on average 3 hours to
start processing, because handlers are all occupied for that amount of time.

You can lower time through a few options:

- Up the max number of handlers (beware using too many, as this just shifts
load to the disks, and also takes more memory)
- Make your requests smaller (use caching or batching on a scan to return
less data per RPC call)
- Lower your client-side timeouts, so that you can handle the issue on the
client side (i.e. retries)
- Investigate disk or network issues that could be causing extremely slow
response times (ensure data is 100% local, too)

Just for perspective, the nominal operating value of this probably varies
greatly with the workload/environment, but in our clusters we have an
Average RPC Queue Time of near 0.  We only see the callQueue fill up in the
case of real problems, and almost always respond with immediate
redistribution of data to other servers.

HTH

 - Bryan

On Wed, Nov 20, 2013 at 11:31 AM, Shawn Hermans <[email protected]>wrote:

> I am using CDH 4.3.1 with HBase 0.94.6.  Using Cloudera manager, I notice a
> metric called Average RPC Queue Time is abnormal.  It is over 3 hours
> normally and drops to a few minutes during non-peak times.  What is the
> meaning of this metric? Are these high queue times normal?
>
> Thanks,
> Shawn
>

Confidentiality Notice:  The information contained in this message, including 
any attachments hereto, may be confidential and is intended to be read only by 
the individual or entity to whom this message is addressed. If the reader of 
this message is not the intended recipient or an agent or designee of the 
intended recipient, please note that any review, use, disclosure or 
distribution of this message or its attachments, in any form, is strictly 
prohibited.  If you have received this message in error, please immediately 
notify the sender and/or [email protected] and delete or destroy any 
copy of this message and its attachments.

RE: Average RPC Queue Time

Reply via email to