Thanks Todd. Better late than never indeed, appreciate it very much.

Yes, precisely, we are dealing with very spikey ingest.

Immediate issue has been addressed though: we extended the spark
KuduContext so we could build our own AsyncKuduClient and
increase defaultOperationTimeoutMs from default 30s to 120s and that has
eliminated the client timeouts.

One followup question: not sure I understand your comment re/ low-latency
requests - if data was ingested it is already in MemStore and therefore
available to clients, so whether queued or not, it should not make a
difference on data availability right? except maybe slow down scans/queries
a bit since they have to read more data from MemStore and uncompacted
RowStores?

thanks again,

-m

On Mon, Apr 20, 2020 at 9:38 AM Todd Lipcon <t...@cloudera.com> wrote:

> Hi Mauricio,
>
> Sorry for the late reply on this one. Hope "better late than never" is the
> case here :)
>
> As you implied in your email, the main issue with increasing queue length
> to deal with queue overflows is that it only helps with momentary spikes.
> According to queueing theory (and intuition) if the rate of arrival of
> entries into a queue is faster than the rate of processing items in that
> queue, then the queue length will grow. If this is a transient phenomenon
> (eg a quick burst of requests) then having a larger queue capacity will
> prevent overflows, but if this is a persistent phenomenon, then there is no
> length of queue that is sufficient to prevent overflows. The one exception
> here is that if the number of potential concurrent queue entries is itself
> bounded (eg because there is a bounded number of clients).
>
> According to the above theory, the philosophy behind the default short
> queue is that longer queues aren't a real solution if the cluster is
> overloaded. That said, if you think that the issues are just transient
> spikes rather than a capacity overload, it's possible that bumping the
> queue length (eg to 100) can help here.
>
> In terms of things to be aware of: having a longer queue means that the
> amount of memory taken by entries in the queue is increased proportionally.
> Currenlty, that memory is not tracked as part of Kudu's Memtracker
> infrastructure, but it does get accounted for in the global heap and can
> push the serve into "memory pressure" mode where requests will start
> getting rejected, rowsets will get flushed, etc. I would recommend that if
> you increase your queues you make sure that you have a relatively larger
> memory limit allocated to your tablet servers and watch out for log
> messages and metrics indicating persistent memory pressure (particularly in
> the 80%+ range where things start getting dropped a lot).
>
> Long queues are also potentially an issue in terms of low-latency
> requests. The longer the queue (in terms of items) the longer the latency
> of elements waiting in that queue. If you have some element of latency
> SLAs, you should monitor them closely as you change queue length
> configuration.
>
> Hope that helps
>
> -Todd
>
>

-- 
Mauricio Aristizabal
Architect - Data Pipeline
mauri...@impact.com | 323 309 4260
https://impact.com
<https://www.linkedin.com/company/impact-partech/>
<https://www.facebook.com/ImpactParTech/>
<https://twitter.com/impactpartech>
<https://www.youtube.com/c/impactpartech>
<https://go.impact.com/WB-PC-AW-Navigating-Partner-Marketing-Strategy-During-Covid-19.html>

Reply via email to