Hi Mauricio, Sorry for the late reply on this one. Hope "better late than never" is the case here :)
As you implied in your email, the main issue with increasing queue length to deal with queue overflows is that it only helps with momentary spikes. According to queueing theory (and intuition) if the rate of arrival of entries into a queue is faster than the rate of processing items in that queue, then the queue length will grow. If this is a transient phenomenon (eg a quick burst of requests) then having a larger queue capacity will prevent overflows, but if this is a persistent phenomenon, then there is no length of queue that is sufficient to prevent overflows. The one exception here is that if the number of potential concurrent queue entries is itself bounded (eg because there is a bounded number of clients). According to the above theory, the philosophy behind the default short queue is that longer queues aren't a real solution if the cluster is overloaded. That said, if you think that the issues are just transient spikes rather than a capacity overload, it's possible that bumping the queue length (eg to 100) can help here. In terms of things to be aware of: having a longer queue means that the amount of memory taken by entries in the queue is increased proportionally. Currenlty, that memory is not tracked as part of Kudu's Memtracker infrastructure, but it does get accounted for in the global heap and can push the serve into "memory pressure" mode where requests will start getting rejected, rowsets will get flushed, etc. I would recommend that if you increase your queues you make sure that you have a relatively larger memory limit allocated to your tablet servers and watch out for log messages and metrics indicating persistent memory pressure (particularly in the 80%+ range where things start getting dropped a lot). Long queues are also potentially an issue in terms of low-latency requests. The longer the queue (in terms of items) the longer the latency of elements waiting in that queue. If you have some element of latency SLAs, you should monitor them closely as you change queue length configuration. Hope that helps -Todd