Thanks Todd. Better late than never indeed, appreciate it very much. Yes, precisely, we are dealing with very spikey ingest.
Immediate issue has been addressed though: we extended the spark KuduContext so we could build our own AsyncKuduClient and increase defaultOperationTimeoutMs from default 30s to 120s and that has eliminated the client timeouts. One followup question: not sure I understand your comment re/ low-latency requests - if data was ingested it is already in MemStore and therefore available to clients, so whether queued or not, it should not make a difference on data availability right? except maybe slow down scans/queries a bit since they have to read more data from MemStore and uncompacted RowStores? thanks again, -m On Mon, Apr 20, 2020 at 9:38 AM Todd Lipcon <t...@cloudera.com> wrote: > Hi Mauricio, > > Sorry for the late reply on this one. Hope "better late than never" is the > case here :) > > As you implied in your email, the main issue with increasing queue length > to deal with queue overflows is that it only helps with momentary spikes. > According to queueing theory (and intuition) if the rate of arrival of > entries into a queue is faster than the rate of processing items in that > queue, then the queue length will grow. If this is a transient phenomenon > (eg a quick burst of requests) then having a larger queue capacity will > prevent overflows, but if this is a persistent phenomenon, then there is no > length of queue that is sufficient to prevent overflows. The one exception > here is that if the number of potential concurrent queue entries is itself > bounded (eg because there is a bounded number of clients). > > According to the above theory, the philosophy behind the default short > queue is that longer queues aren't a real solution if the cluster is > overloaded. That said, if you think that the issues are just transient > spikes rather than a capacity overload, it's possible that bumping the > queue length (eg to 100) can help here. > > In terms of things to be aware of: having a longer queue means that the > amount of memory taken by entries in the queue is increased proportionally. > Currenlty, that memory is not tracked as part of Kudu's Memtracker > infrastructure, but it does get accounted for in the global heap and can > push the serve into "memory pressure" mode where requests will start > getting rejected, rowsets will get flushed, etc. I would recommend that if > you increase your queues you make sure that you have a relatively larger > memory limit allocated to your tablet servers and watch out for log > messages and metrics indicating persistent memory pressure (particularly in > the 80%+ range where things start getting dropped a lot). > > Long queues are also potentially an issue in terms of low-latency > requests. The longer the queue (in terms of items) the longer the latency > of elements waiting in that queue. If you have some element of latency > SLAs, you should monitor them closely as you change queue length > configuration. > > Hope that helps > > -Todd > > -- Mauricio Aristizabal Architect - Data Pipeline mauri...@impact.com | 323 309 4260 https://impact.com <https://www.linkedin.com/company/impact-partech/> <https://www.facebook.com/ImpactParTech/> <https://twitter.com/impactpartech> <https://www.youtube.com/c/impactpartech> <https://go.impact.com/WB-PC-AW-Navigating-Partner-Marketing-Strategy-During-Covid-19.html>