Some random notes, not necessarily going to help you, but:
- You probably have vnodes enable, which means one bad node is PROBABLY a
replica of almost every other node, so the fanout here is worse than it
should be, and
- You probably have speculative retry on the table set to a percentile. As
the host gets slow, the percentiles change, and speculative retry stop
being useful, so you end up timing out queries

If you change speculative retry to use the MIN(Xms, p99) syntax, with X set
on your real workload, you can likely force it to speculate sooner when
that one host gets sick.

The harder thing to solve is a bad coordinator node slowing down all reads
coordinated by that node. Retry at the client level to work around that
tends to be effective.



On Wed, Oct 13, 2021 at 2:22 PM S G <sg.online.em...@gmail.com> wrote:

> Hello,
>
> We have frequently seen that a single bad node running slow can affect the
> latencies of the entire cluster (especially for queries where the slow node
> was acting as a coordinator).
>
>
> Is there any suggestion to avoid this behavior?
>
> Like something on the client side to not query that bad node or something
> on the bad node that redirects its query to other healthy coordinators?
>
>
> Thanks,
>
>
>

Reply via email to