Try changing the chunk length parameter on the compression settings to 4kb,
and reduce read ahead to 16kb if you’re using EBS or 4KB if you’re using
decent local ssd or nvme.

Counters read before write.

—
Jon Haddad
Rustyrazorblade Consulting
rustyrazorblade.com


On Fri, Apr 5, 2024 at 9:27 AM Subroto Barua <sbarua...@yahoo.com> wrote:

> follow up question on performance issue with 'counter writes'- is there a
> parameter or condition that limits the allocation rate for
> 'CounterMutationStage'? I see 13-18mb/s for 4.1.4 Vs 20-25mb/s for 4.0.5.
>
> The back-end infra is same for both the clusters and same test cases/data
> model.
> On Saturday, March 30, 2024 at 08:40:28 AM PDT, Jon Haddad <
> j...@jonhaddad.com> wrote:
>
>
> Hi,
>
> Unfortunately, the numbers you're posting have no meaning without
> context.  The speculative retries could be the cause of a problem, or you
> could simply be executing enough queries and you have a fairly high
> variance in latency which triggers them often.  It's unclear how many
> queries / second you're executing and there's no historical information to
> suggest if what you're seeing now is an anomaly or business as usual.
>
> If you want to determine if your theory that speculative retries are
> causing your performance issue, then you could try changing speculative
> retry to a fixed value instead of a percentile, such as 50MS.  It's easy
> enough to try and you can get an answer to your question almost immediately.
>
> The problem with this is that you're essentially guessing based on very
> limited information - the output of a nodetool command you've run "every
> few secs".  I prefer to use a more data driven approach.  Get a CPU flame
> graph and figure out where your time is spent:
> https://rustyrazorblade.com/post/2023/2023-11-07-async-profiler/
>
> The flame graph will reveal where your time is spent, and you can focus on
> improving that, rather than looking at a random statistic that you've
> picked.
>
> I just gave a talk at SCALE on distributed systems performance
> troubleshooting.  You'll be better off following a methodical process than
> guessing at potential root causes, because the odds of you correctly
> guessing the root cause in a system this complex is close to zero.  My talk
> is here: https://www.youtube.com/watch?v=VX9tHk3VTLE
>
> I'm guessing you don't have dashboards in place if you're relying on
> nodetool output with grep.  If your cluster is under 6 nodes, you can take
> advantage of AxonOps's free tier: https://axonops.com/
>
> Good dashboards are essential for these types of problems.
>
> Jon
>
>
>
> On Sat, Mar 30, 2024 at 2:33 AM ranju goel <goel.ra...@gmail.com> wrote:
>
> Hi All,
>
> On debugging the cluster for performance dip seen while using 4.1.4,  i
> found high speculation retries Value in nodetool tablestats during read
> operation.
>
> I ran the below tablestats command and checked its output after every few
> secs and noticed that retries are on rising side. Also there is one open
> ticket (https://issues.apache.org/jira/browse/CASSANDRA-18766) similar to
> this.
> /usr/share/cassandra/bin/nodetool -u <username> -pw <pwd> -p <port>
> tablestats <keyspace> | grep -i 'Speculative retries'
>
>
>
>                 Speculative retries: 11633
>
>                 ..
>
>                 ..
>
>                 Speculative retries: 13727
>
>
>
>                 Speculative retries: 14256
>
>                 Speculative retries: 14855
>
>                 Speculative retries: 14858
>
>                 Speculative retries: 14859
>
>                 Speculative retries: 14873
>
>                 Speculative retries: 14875
>
>                 Speculative retries: 14890
>
>                 Speculative retries: 14893
>
>                 Speculative retries: 14896
>
>                 Speculative retries: 14901
>
>                 Speculative retries: 14905
>
>                 Speculative retries: 14946
>
>                 Speculative retries: 14948
>
>                 Speculative retries: 14957
>
>
> Suspecting this could be performance dip cause.  Please add in case anyone
> knows more about it.
>
>
> Regards
>
>
>
>
>
>
>
>
> On Wed, Mar 27, 2024 at 10:43 PM Subroto Barua via user <
> user@cassandra.apache.org> wrote:
>
> we are seeing similar perf issues with counter writes - to reproduce:
>
> cassandra-stress counter_write n=100000 no-warmup cl=LOCAL_QUORUM -rate
> threads=50 -mode native cql3 user=<user> password=<pw> -name <cluster_name>
>
>
> op rate: 39,260 ops (4.1) and 63,689 ops (4.0)
> latency 99th percentile: 7.7ms (4.1) and 1.8ms (4.0)
> Total GC count: 750 (4.1) and 744 (4.0)
> Avg GC time: 106 ms (4.1) and 110.1 ms (4.0)
>
>
> On Wednesday, March 27, 2024 at 12:18:50 AM PDT, ranju goel <
> goel.ra...@gmail.com> wrote:
>
>
> Hi All,
>
> Was going through this mail chain
> (https://www.mail-archive.com/user@cassandra.apache.org/msg63564.html)
>  and was wondering that if this could cause a performance degradation in
> 4.1 without changing compactionThroughput.
>
> As seeing performance dip in Read/Write after upgrading from 4.0 to 4.1.
>
> Regards
> Ranju
>
>

Reply via email to