Hi Mauricio,

Sorry for the late reply on this one. Hope "better late than never" is the
case here :)

As you implied in your email, the main issue with increasing queue length
to deal with queue overflows is that it only helps with momentary spikes.
According to queueing theory (and intuition) if the rate of arrival of
entries into a queue is faster than the rate of processing items in that
queue, then the queue length will grow. If this is a transient phenomenon
(eg a quick burst of requests) then having a larger queue capacity will
prevent overflows, but if this is a persistent phenomenon, then there is no
length of queue that is sufficient to prevent overflows. The one exception
here is that if the number of potential concurrent queue entries is itself
bounded (eg because there is a bounded number of clients).

According to the above theory, the philosophy behind the default short
queue is that longer queues aren't a real solution if the cluster is
overloaded. That said, if you think that the issues are just transient
spikes rather than a capacity overload, it's possible that bumping the
queue length (eg to 100) can help here.

In terms of things to be aware of: having a longer queue means that the
amount of memory taken by entries in the queue is increased proportionally.
Currenlty, that memory is not tracked as part of Kudu's Memtracker
infrastructure, but it does get accounted for in the global heap and can
push the serve into "memory pressure" mode where requests will start
getting rejected, rowsets will get flushed, etc. I would recommend that if
you increase your queues you make sure that you have a relatively larger
memory limit allocated to your tablet servers and watch out for log
messages and metrics indicating persistent memory pressure (particularly in
the 80%+ range where things start getting dropped a lot).

Long queues are also potentially an issue in terms of low-latency requests.
The longer the queue (in terms of items) the longer the latency of elements
waiting in that queue. If you have some element of latency SLAs, you should
monitor them closely as you change queue length configuration.

Hope that helps

-Todd

Reply via email to