Hi, Unfortunately, the numbers you're posting have no meaning without context. The speculative retries could be the cause of a problem, or you could simply be executing enough queries and you have a fairly high variance in latency which triggers them often. It's unclear how many queries / second you're executing and there's no historical information to suggest if what you're seeing now is an anomaly or business as usual.
If you want to determine if your theory that speculative retries are causing your performance issue, then you could try changing speculative retry to a fixed value instead of a percentile, such as 50MS. It's easy enough to try and you can get an answer to your question almost immediately. The problem with this is that you're essentially guessing based on very limited information - the output of a nodetool command you've run "every few secs". I prefer to use a more data driven approach. Get a CPU flame graph and figure out where your time is spent: https://rustyrazorblade.com/post/2023/2023-11-07-async-profiler/ The flame graph will reveal where your time is spent, and you can focus on improving that, rather than looking at a random statistic that you've picked. I just gave a talk at SCALE on distributed systems performance troubleshooting. You'll be better off following a methodical process than guessing at potential root causes, because the odds of you correctly guessing the root cause in a system this complex is close to zero. My talk is here: https://www.youtube.com/watch?v=VX9tHk3VTLE I'm guessing you don't have dashboards in place if you're relying on nodetool output with grep. If your cluster is under 6 nodes, you can take advantage of AxonOps's free tier: https://axonops.com/ Good dashboards are essential for these types of problems. Jon On Sat, Mar 30, 2024 at 2:33 AM ranju goel <goel.ra...@gmail.com> wrote: > Hi All, > > On debugging the cluster for performance dip seen while using 4.1.4, i > found high speculation retries Value in nodetool tablestats during read > operation. > > I ran the below tablestats command and checked its output after every few > secs and noticed that retries are on rising side. Also there is one open > ticket (https://issues.apache.org/jira/browse/CASSANDRA-18766) similar to > this. > /usr/share/cassandra/bin/nodetool -u <username> -pw <pwd> -p <port> > tablestats <keyspace> | grep -i 'Speculative retries' > > > > Speculative retries: 11633 > > .. > > .. > > Speculative retries: 13727 > > > > Speculative retries: 14256 > > Speculative retries: 14855 > > Speculative retries: 14858 > > Speculative retries: 14859 > > Speculative retries: 14873 > > Speculative retries: 14875 > > Speculative retries: 14890 > > Speculative retries: 14893 > > Speculative retries: 14896 > > Speculative retries: 14901 > > Speculative retries: 14905 > > Speculative retries: 14946 > > Speculative retries: 14948 > > Speculative retries: 14957 > > > Suspecting this could be performance dip cause. Please add in case anyone > knows more about it. > > > Regards > > > > > > > > > On Wed, Mar 27, 2024 at 10:43 PM Subroto Barua via user < > user@cassandra.apache.org> wrote: > >> we are seeing similar perf issues with counter writes - to reproduce: >> >> cassandra-stress counter_write n=100000 no-warmup cl=LOCAL_QUORUM -rate >> threads=50 -mode native cql3 user=<user> password=<pw> -name <cluster_name> >> >> >> op rate: 39,260 ops (4.1) and 63,689 ops (4.0) >> latency 99th percentile: 7.7ms (4.1) and 1.8ms (4.0) >> Total GC count: 750 (4.1) and 744 (4.0) >> Avg GC time: 106 ms (4.1) and 110.1 ms (4.0) >> >> >> On Wednesday, March 27, 2024 at 12:18:50 AM PDT, ranju goel < >> goel.ra...@gmail.com> wrote: >> >> >> Hi All, >> >> Was going through this mail chain >> (https://www.mail-archive.com/user@cassandra.apache.org/msg63564.html) >> and was wondering that if this could cause a performance degradation in >> 4.1 without changing compactionThroughput. >> >> As seeing performance dip in Read/Write after upgrading from 4.0 to 4.1. >> >> Regards >> Ranju >> >