Based on the information I know, I'd say that you don't have any specific issue with the authentication related tables, but you do have a general overloading problem during peak load. I think it's fairly likely that your 7 nodes cluster (6 nodes in one DC) is not able to keep up with the peak load, and you will need to either scale up for the peak load or tune the application to avoid the bursty behaviour.

On 27/07/2021 16:27, Chahat Bhatia wrote:
Yes, the application in quite read heavy and the request pattern is bursty too. Hence that big a request failure in such less time.
Also, nothing out of the ordinary in cfstats and proxyhistograms.

But there are Native-Transport-Requests dropped messages (Almost similar stats on all the nodes) :

 /opt/evive/pkgs/apache-cassandra-3.0.13/bin/nodetool tpstats

    Pool Name               Active   Pending      Completed    Blocked
    All time blocked
    ...
    Native-Transport-Requests         6         0 27777476         0  
               8262

    Message type           Dropped
    READ                     24260


*
We have the below values for the following params:*
concurrent_reads: 96
native_transport_max_threads: 320
JVM_OPTS="$JVM_OPTS -Dcassandra.max_queued_native_transport_requests=3072"

Do you think it would be advisable to tune the number in the above params to have lesser load on the node?


On Tue, 27 Jul 2021 at 20:13, Bowen Song <bo...@bso.ng> wrote:

    Wow, 15 seconds timeout? That's pretty long... You may want to
    check the nodetool tpstats and make sure the NTP thread pool isn't
    blocking things.

    16k read requests dropped in 5 seconds, or over 3k requests per
    second on a single node, is a bit suspicious. Does your read
    requests tend to be bursty?

    On 27/07/2021 15:32, Chahat Bhatia wrote:
    Yes, RF=6 for system auth. Sorry my bad.


    No, we are not using cassandra user for the application. We have
    a custom super user for our operational and administrative tasks
    and a separate role with needed perms for the application.

         role         | super | login | options
        ------------------------+-------+-------+---------
 EviveCassandra |  True |  True |        {}
        ccusr | False |  True |        {}


      *  gc_warn_threshold_in_ms: 1000
      * read_request_timeout_in_ms: 15000  // been the same since the
        start
      * cross_node_timeout: false           // but we have ntp on
        each server and the clocks are synced.

      * On checking the logs a little more I found out these:

            INFO  [ScheduledTasks:1] 2021-07-26 07:21:33,279
            MessagingService.java:1013 - READ messages were dropped
            in last 5000 ms: 289 for internal timeout and 0 for cross
            node timeout

            INFO  [ScheduledTasks:1] 2021-07-26 07:22:33,894
            MessagingService.java:1013 - READ messages were dropped
            in last 5000 ms: 273 for internal timeout and 0 for cross
            node timeout

            INFO  [ScheduledTasks:1] 2021-07-27 06:12:03,988
            MessagingService.java:1013 - READ messages were dropped
            in last 5000 ms: 160 for internal timeout and 0 for cross
            node timeout

            INFO  [ScheduledTasks:1] 2021-07-27 06:13:09,125
            MessagingService.java:1013 - READ messages were dropped
            in last 5000 ms: 223 for internal timeout and 0 for cross
            node timeout

            INFO  [ScheduledTasks:1] 2021-07-27 06:13:44,212
            MessagingService.java:1013 - READ messages were dropped
            in last 5000 ms: 72 for internal timeout and 0 for cross
            node timeout

            INFO  [ScheduledTasks:1] 2021-07-27 06:14:14,496
            MessagingService.java:1013 - READ messages were dropped
            in last 5000 ms: 16036 for internal timeout and 0 for
            cross node timeout

            INFO  [ScheduledTasks:1] 2021-07-27 06:15:59,744
            MessagingService.java:1013 - READ messages were dropped
            in last 5000 ms: 287 for internal timeout and 0 for cross
            node timeout


      * Also, I am checking cfstats and proxyhistorgrams is in
        progress, will update incase anythings suspicious.


    On Tue, 27 Jul 2021 at 18:09, Bowen Song <bo...@bso.ng>
    <mailto:bo...@bso.ng> wrote:

        Hello Chahat,


        You haven't replied to the first point, are you using the
        "cassandra" user?


        The schema and your description don't quite match. When you said:

            /the system_auth  for 2 DCs : //*us-east*//with 6 nodes
            (and RF=3) and ...
            /

        I assume you meant to say 6 nodes and RF=6? That sounds
        totally fine.


        As of the GC logs, do you see any WARN level logs? You may
        also want to double check the choice of the
        gc_warn_threshold_in_ms value is reasonable. I'm a bit
        surprised to see the GC logs full of ~1.6ms GC pauses. AFAIK,
        by default Cassandra only logs GC pauses longer than 200ms. I
        have to assume that's the result of your custom script. You
        may want to double check the script actually does what it
        says on the tin, just in case it misses something.


        Then, the other relevant questions.

        Do you use cross node timeout? If so, are the clocks in sync
        between all servers?

        What is the timeout values in the cassandra.yaml file?
        (read_request_timeout_in_ms, etc.) Are they reasonably long
        enough for the corresponding request type to complete?

        Since you've only got 7 nodes, I'd also recommend you to
        check the nodetool cfstats & nodetool cfhistograms output for
        the tables in the system_auth keyspace on each node, and see
        if you can spot anything out of ordinary.


        Regards,

        Bowen

        On 27/07/2021 09:23, Chahat Bhatia wrote:
        Thanks for the prompt response.
        *
        Here is the system_schema.keyspaces entry:*

                      system_auth |           True | {'class':
                'org.apache.cassandra.locator.NetworkTopologyStrategy',
                'us-east': '6', 'us-east-backup': '1'}
                                     census |           True |
                {'class':
                'org.apache.cassandra.locator.NetworkTopologyStrategy',
                'us-east': '3', 'us-east-backup': '1'}


        So, the system_auth  for 2 DCs : /*us-east* with 6 nodes
        (and RF=3) and *us-east-backup* with a single node and 100%
        of the data./
        *Keyspace census* showing RF=3 for the main DC and RF=1 for
        the backup site.


            And for running the application, we have a user
            specifically created for that application itself and
            that user also has its appropriate permissions in
            cassandra to select, modify and delete from concerned
            tables. And it uses LOCAL_QUORUM for querying the data
and the local-dc is set to 'us-east'.


        Also, there is no excessive GC for any of the nodes, we run
        a custom script to trackthe GC stats (from the cassandra log
        itself) and output it. Below is the output from the current
        running script and its similar for all the servers:

                2021-07-27 03:04:45,072 INFO gcstats:58 -
                Application Thread stop time 0.001656 seconds.
                2021-07-27 03:04:45,080 INFO gcstats:58 -
                Application Thread stop time 0.001669 seconds.
                2021-07-27 03:04:45,087 INFO gcstats:58 -
                Application Thread stop time 0.001601 seconds.
                2021-07-27 03:04:45,095 INFO gcstats:58 -
                Application Thread stop time 0.001713 seconds.
                2021-07-27 03:04:45,103 INFO gcstats:58 -
                Application Thread stop time 0.001586 seconds.
                2021-07-27 03:04:45,110 INFO gcstats:58 -
                Application Thread stop time 0.001671 seconds.
                2021-07-27 03:04:45,118 INFO gcstats:58 -
                Application Thread stop time 0.001691 seconds.
                2021-07-27 03:04:45,127 INFO gcstats:58 -
                Application Thread stop time 0.001860 seconds.
                2021-07-27 03:04:45,134 INFO gcstats:58 -
                Application Thread stop time 0.001630 seconds.
                2021-07-27 03:04:45,141 INFO gcstats:58 -
                Application Thread stop time 0.001515 seconds.
                2021-07-27 03:04:45,148 INFO gcstats:58 -
                Application Thread stop time 0.001533 seconds.
                2021-07-27 03:04:45,156 INFO gcstats:58 -
                Application Thread stop time 0.001630 seconds.
                2021-07-27 03:04:45,163 INFO gcstats:58 -
                Application Thread stop time 0.001577 seconds.
                2021-07-27 03:04:45,170 INFO gcstats:58 -
                Application Thread stop time 0.001538 seconds.
                2021-07-27 03:04:45,177 INFO gcstats:58 -
                Application Thread stop time 0.001615 seconds.
                2021-07-27 03:04:45,186 INFO gcstats:58 -
                Application Thread stop time 0.001584 seconds.
                2021-07-27 03:04:45,193 INFO gcstats:58 -
                Application Thread stop time 0.001558 seconds.
                2021-07-27 03:04:45,200 INFO gcstats:58 -
                Application Thread stop time 0.001696 seconds.
                2021-07-27 03:04:45,208 INFO gcstats:58 -
                Application Thread stop time 0.001658 seconds.
                2021-07-27 03:04:45,215 INFO gcstats:58 -
                Application Thread stop time 0.001592 seconds.
                2021-07-27 03:04:45,222 INFO gcstats:58 -
                Application Thread stop time 0.001618 seconds.
                2021-07-27 03:05:08,907 INFO gcstats:58 -
                Application Thread stop time 0.001624 seconds.
                2021-07-27 03:06:34,436 INFO gcstats:58 -
                Application Thread stop time 0.297773 seconds.






        On Tue, 27 Jul 2021 at 13:23, 'Bowen Song' via Infra Updates
        <infra-upda...@goevive.com
        <mailto:infra-upda...@goevive.com>> wrote:

            Hello Chahat,


            First, can you please make sure the Cassandra user used
            by the application is not "cassandra"? Because the
            "cassandra" user uses QUORUM consistency level to read
            the auth tables.

            Then, can you please make sure the replication strategy
            is set correctly for the system_auth namespace? I.e.:
            ensure the old DC is not present, and the new DC has
            sufficient number of replicas for fault tolerance.

            Finally, can you please check the GC logs, and make sure
            there isn't JVM GC issues, espicially long STW pauses?


            Regards,

            Bowen


            On 27/07/2021 08:34, Chahat Bhatia wrote:
            Hi Community,

            Context: We are running a cluster of 6 nodes in
            production with a RF=3 in AWS.
            We recently moved from physical servers to cloud by
            adding a new DC and then removing the old one.
            Everything is working fine in all the other
            applications except this one.

            *As we recently started experiencing read timeouts in
            one of our production applications where the client threw
            *

                    Error An unexpected error occurred server side
                    
onip-IP.ec2.internal:com.google.common.util.concurrent.*UncheckedExecutionException*:*com.google.common.util.concurrent.UncheckedExecutionException:**java.lang.RuntimeException:org.apache.cassandra.exceptions.ReadTimeoutException:Operation
                    timed out-received
                    
only0responses.*com.datastax.driver.core.exceptions.ServerError:An
                    unexpected error occurred server side
                    
:com.google.common.util.concurrent.UncheckedExecutionException:com.google.common.util.concurrent.UncheckedExecutionException:java.lang.RuntimeException:org.apache.cassandra.exceptions.ReadTimeoutException:Operation
                    timed out-received only0responses.

                    at
                    
com.datastax.driver.core.exceptions.ServerError.copy(ServerError.java:63)~[cassandra-driver-core-3.3.0-shaded.jar!/:?]at
                    
com.datastax.driver.core.exceptions.ServerError.copy(ServerError.java:25)~[cassandra-driver-core-3.3.0-shaded.jar!/:?]at
                    
com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:37)~[cassandra-driver-core-3.3.0-shaded.jar!/:?]at
                    
com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:245)~[cassandra-driver-core-3.3.0-shaded.jar!/:?]at
                    
com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:68)~[cassandra-driver-core-3.3.0-shaded.jar!/:?
]
                     ............ cntd

                    com.datastax.driver.core.exceptions.ReadTimeoutException:
                    Cassandra timeout during read query at
                    consistency LOCAL_QUORUM (2 responses were
                    required but only 1 replica responded)



            *And around the same time these were the errors on the
            server side (from cassandra logs):

            *

                    *ERROR [RolesCacheRefresh:1] 2021-07-26
                    06:32:43,094 CassandraDaemon.java:207 -
                    Exception in thread
                    Thread[RolesCacheRefresh:1,5,main]
                    java.lang.RuntimeException:
                    org.apache.cassandra.exceptions.ReadTimeoutException:
                    Operation timed out - received only 0 responses.
                    *        at
                    
org.apache.cassandra.auth.CassandraRoleManager.getRole(CassandraRoleManager.java:512)
                    ~[apache-cassandra-3.0.13.jar:3.0.13]
                            at
                    
org.apache.cassandra.auth.CassandraRoleManager.getRoles(CassandraRoleManager.java:280)
                    ~[apache-cassandra-3.0.13.jar:3.0.13]
                            at
                    
org.apache.cassandra.auth.RolesCache$1$1.call(RolesCache.java:135)
                    ~[apache-cassandra-3.0.13.jar:3.0.13]
                            at
                    
org.apache.cassandra.auth.RolesCache$1$1.call(RolesCache.java:130)
                    ~[apache-cassandra-3.0.13.jar:3.0.13]
                            at
                    java.util.concurrent.FutureTask.run(FutureTask.java:266)
                    ~[na:1.8.0_131]
                            at
                    
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
                    ~[na:1.8.0_131]
                            at
                    
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
                    [na:1.8.0_131]
                            at
                    
org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79)
                    [apache-cassandra-3.0.13.jar:3.0.13]
                            at
                    java.lang.Thread.run(Thread.java:748)
                    ~[na:1.8.0_131]
                    Caused by:
                    org.apache.cassandra.exceptions.ReadTimeoutException:
                    Operation timed out - received only 0 responses.
                            at
                    
org.apache.cassandra.service.ReadCallback.awaitResults(ReadCallback.java:132)
                    ~[apache-cassandra-3.0.13.jar:3.0.13]
                            at
                    
org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:137)
                    ~[apache-cassandra-3.0.13.jar:3.0.13]
                            at
                    
org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:145)
                    ~[apache-cassandra-3.0.13.jar:3.0.13]
                            at
                    
org.apache.cassandra.service.StorageProxy$SinglePartitionReadLifecycle.awaitResultsAndRetryOnDigestMismatch(StorageProxy.java:1715)
                    ~[apache-cassandra-3.0.13.jar:3.0.13]
                            at
                    
org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1664)
                    ~[apache-cassandra-3.0.13.jar:3.0.13]
                            at
                    
org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1605)
                    ~[apache-cassandra-3.0.13.jar:3.0.13]
                            at
                    
org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1524)
                    ~[apache-cassandra-3.0.13.jar:3.0.13]
                            at
                    
org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:955)
                    ~[apache-cassandra-3.0.13.jar:3.0.13]
                            at
                    
org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:263)
                    ~[apache-cassandra-3.0.13.jar:3.0.13]
                            at
                    
org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:224)
                    ~[apache-cassandra-3.0.13.jar:3.0.13]
                            at
                    
org.apache.cassandra.auth.CassandraRoleManager.getRoleFromTable(CassandraRoleManager.java:520)
                    ~[apache-cassandra-3.0.13.jar:3.0.13]
                            at
                    
org.apache.cassandra.auth.CassandraRoleManager.getRole(CassandraRoleManager.java:502)
                    ~[apache-cassandra-3.0.13.jar:3.0.13]

                    *ERROR [PermissionsCacheRefresh:1] 2021-07-26
                    07:11:25,804 CassandraDaemon.java:207 -
                    Exception in thread
                    Thread[PermissionsCacheRefresh:1,5,main]
                    java.lang.RuntimeException:
                    org.apache.cassandra.exceptions.ReadTimeoutException:
                    Operation timed out - received only 0 responses.
                    *        at
                    
org.apache.cassandra.auth.CassandraRoleManager.getRole(CassandraRoleManager.java:512)
                    ~[apache-cassandra-3.0.13.jar:3.0.13]
                            at
                    
org.apache.cassandra.auth.CassandraRoleManager.isSuper(CassandraRoleManager.java:304)
                    ~[apache-cassandra-3.0.13.jar:3.0.13]
                            at
                    
org.apache.cassandra.auth.Roles.hasSuperuserStatus(Roles.java:52)
                    ~[apache-cassandra-3.0.13.jar:3.0.13]
                            at
                    
org.apache.cassandra.auth.AuthenticatedUser.isSuper(AuthenticatedUser.java:71)
                    ~[apache-cassandra-3.0.13.jar:3.0.13]
                            at
                    
org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:76)
                    ~[apache-cassandra-3.0.13.jar:3.0.13]
                            at
                    
org.apache.cassandra.auth.PermissionsCache$1$1.call(PermissionsCache.java:136)
                    ~[apache-cassandra-3.0.13.jar:3.0.13]
                            at
                    
org.apache.cassandra.auth.PermissionsCache$1$1.call(PermissionsCache.java:131)
                    ~[apache-cassandra-3.0.13.jar:3.0.13]
                            at
                    java.util.concurrent.FutureTask.run(FutureTask.java:266)
                    ~[na:1.8.0_131]
                            at
                    
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
                    ~[na:1.8.0_131]
                            at
                    
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
                    [na:1.8.0_131]
                            at
                    
org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79)
                    [apache-cassandra-3.0.13.jar:3.0.13]
                            at
                    java.lang.Thread.run(Thread.java:748)
                    ~[na:1.8.0_131]
                    Caused by:
                    org.apache.cassandra.exceptions.ReadTimeoutException:
                    Operation timed out - received only 0 responses.
                            at
                    
org.apache.cassandra.service.ReadCallback.awaitResults(ReadCallback.java:132)
                    ~[apache-cassandra-3.0.13.jar:3.0.13]
                            at
                    
org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:137)
                    ~[apache-cassandra-3.0.13.jar:3.0.13]
                            at
                    
org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:145)
                    ~[apache-cassandra-3.0.13.jar:3.0.13]
                            at
                    
org.apache.cassandra.service.StorageProxy$SinglePartitionReadLifecycle.awaitResultsAndRetryOnDigestMismatch(StorageProxy.java:1715)
                    ~[apache-cassandra-3.0.13.jar:3.0.13]
                            at
                    
org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1664)
                    ~[apache-cassandra-3.0.13.jar:3.0.13]
                            at
                    
org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1605)
                    ~[apache-cassandra-3.0.13.jar:3.0.13]
                            at
                    
org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1524)
                    ~[apache-cassandra-3.0.13.jar:3.0.13]
                            at
                    
org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:955)
                    ~[apache-cassandra-3.0.13.jar:3.0.13]
                            at
                    
org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:263)
                    ~[apache-cassandra-3.0.13.jar:3.0.13]
                            at
                    
org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:224)
                    ~[apache-cassandra-3.0.13.jar:3.0.13]
                            at
                    
org.apache.cassandra.auth.CassandraRoleManager.getRoleFromTable(CassandraRoleManager.java:520)
                    ~[apache-cassandra-3.0.13.jar:3.0.13]
                            at
                    
org.apache.cassandra.auth.CassandraRoleManager.getRole(CassandraRoleManager.java:502)
                    ~[apache-cassandra-3.0.13.jar:3.0.13]



            *These are the values of these params in my
            configuration file
            *

                    permissions_validity_in_ms: 300000
                    permissions_update_interval_in_ms: 20000
                    roles_validity_in_ms: 300000
                    roles_update_interval_in_ms: 20000


            This was not the case earlier and since this comes from
            a single app alone we are not sure if this is actually
            the issue. Can anyone please point out if these values
            are misconfigured and hence causing the issue or is it
            somewhere else we should be looking at?

            Any help would be appreciated.

            Thanks & Regards,
            Chahat.



-- Thanks & Regards,
        Chahat Bhatia
        Systems Engineer
        *Evive*
        +91 7087629779



-- Thanks & Regards,
    Chahat Bhatia
    Systems Engineer
    *Evive*
    +91 7087629779



--
Thanks & Regards,
Chahat Bhatia
Systems Engineer
*Evive*
+91 7087629779

Reply via email to