Hi,

Unfortunately, the numbers you're posting have no meaning without context.
The speculative retries could be the cause of a problem, or you could
simply be executing enough queries and you have a fairly high variance in
latency which triggers them often.  It's unclear how many queries / second
you're executing and there's no historical information to suggest if what
you're seeing now is an anomaly or business as usual.

If you want to determine if your theory that speculative retries are
causing your performance issue, then you could try changing speculative
retry to a fixed value instead of a percentile, such as 50MS.  It's easy
enough to try and you can get an answer to your question almost immediately.

The problem with this is that you're essentially guessing based on very
limited information - the output of a nodetool command you've run "every
few secs".  I prefer to use a more data driven approach.  Get a CPU flame
graph and figure out where your time is spent:
https://rustyrazorblade.com/post/2023/2023-11-07-async-profiler/

The flame graph will reveal where your time is spent, and you can focus on
improving that, rather than looking at a random statistic that you've
picked.

I just gave a talk at SCALE on distributed systems performance
troubleshooting.  You'll be better off following a methodical process than
guessing at potential root causes, because the odds of you correctly
guessing the root cause in a system this complex is close to zero.  My talk
is here: https://www.youtube.com/watch?v=VX9tHk3VTLE

I'm guessing you don't have dashboards in place if you're relying on
nodetool output with grep.  If your cluster is under 6 nodes, you can take
advantage of AxonOps's free tier: https://axonops.com/

Good dashboards are essential for these types of problems.

Jon



On Sat, Mar 30, 2024 at 2:33 AM ranju goel <goel.ra...@gmail.com> wrote:

> Hi All,
>
> On debugging the cluster for performance dip seen while using 4.1.4,  i
> found high speculation retries Value in nodetool tablestats during read
> operation.
>
> I ran the below tablestats command and checked its output after every few
> secs and noticed that retries are on rising side. Also there is one open
> ticket (https://issues.apache.org/jira/browse/CASSANDRA-18766) similar to
> this.
> /usr/share/cassandra/bin/nodetool -u <username> -pw <pwd> -p <port>
> tablestats <keyspace> | grep -i 'Speculative retries'
>
>
>
>                 Speculative retries: 11633
>
>                 ..
>
>                 ..
>
>                 Speculative retries: 13727
>
>
>
>                 Speculative retries: 14256
>
>                 Speculative retries: 14855
>
>                 Speculative retries: 14858
>
>                 Speculative retries: 14859
>
>                 Speculative retries: 14873
>
>                 Speculative retries: 14875
>
>                 Speculative retries: 14890
>
>                 Speculative retries: 14893
>
>                 Speculative retries: 14896
>
>                 Speculative retries: 14901
>
>                 Speculative retries: 14905
>
>                 Speculative retries: 14946
>
>                 Speculative retries: 14948
>
>                 Speculative retries: 14957
>
>
> Suspecting this could be performance dip cause.  Please add in case anyone
> knows more about it.
>
>
> Regards
>
>
>
>
>
>
>
>
> On Wed, Mar 27, 2024 at 10:43 PM Subroto Barua via user <
> user@cassandra.apache.org> wrote:
>
>> we are seeing similar perf issues with counter writes - to reproduce:
>>
>> cassandra-stress counter_write n=100000 no-warmup cl=LOCAL_QUORUM -rate
>> threads=50 -mode native cql3 user=<user> password=<pw> -name <cluster_name>
>>
>>
>> op rate: 39,260 ops (4.1) and 63,689 ops (4.0)
>> latency 99th percentile: 7.7ms (4.1) and 1.8ms (4.0)
>> Total GC count: 750 (4.1) and 744 (4.0)
>> Avg GC time: 106 ms (4.1) and 110.1 ms (4.0)
>>
>>
>> On Wednesday, March 27, 2024 at 12:18:50 AM PDT, ranju goel <
>> goel.ra...@gmail.com> wrote:
>>
>>
>> Hi All,
>>
>> Was going through this mail chain
>> (https://www.mail-archive.com/user@cassandra.apache.org/msg63564.html)
>>  and was wondering that if this could cause a performance degradation in
>> 4.1 without changing compactionThroughput.
>>
>> As seeing performance dip in Read/Write after upgrading from 4.0 to 4.1.
>>
>> Regards
>> Ranju
>>
>

Reply via email to