Yes, RF=6 for system auth. Sorry my bad.
No, we are not using cassandra user for the application. We have a
custom super user for our operational and administrative tasks and a
separate role with needed perms for the application.
role | super | login | options
------------------------+-------+-------+---------
EviveCassandra | True | True | {}
ccusr | False | True | {}
* gc_warn_threshold_in_ms: 1000
* read_request_timeout_in_ms: 15000 // been the same since the start
* cross_node_timeout: false // but we have ntp on each
server and the clocks are synced.
* On checking the logs a little more I found out these:
INFO [ScheduledTasks:1] 2021-07-26 07:21:33,279
MessagingService.java:1013 - READ messages were dropped in
last 5000 ms: 289 for internal timeout and 0 for cross node
timeout
INFO [ScheduledTasks:1] 2021-07-26 07:22:33,894
MessagingService.java:1013 - READ messages were dropped in
last 5000 ms: 273 for internal timeout and 0 for cross node
timeout
INFO [ScheduledTasks:1] 2021-07-27 06:12:03,988
MessagingService.java:1013 - READ messages were dropped in
last 5000 ms: 160 for internal timeout and 0 for cross node
timeout
INFO [ScheduledTasks:1] 2021-07-27 06:13:09,125
MessagingService.java:1013 - READ messages were dropped in
last 5000 ms: 223 for internal timeout and 0 for cross node
timeout
INFO [ScheduledTasks:1] 2021-07-27 06:13:44,212
MessagingService.java:1013 - READ messages were dropped in
last 5000 ms: 72 for internal timeout and 0 for cross node timeout
INFO [ScheduledTasks:1] 2021-07-27 06:14:14,496
MessagingService.java:1013 - READ messages were dropped in
last 5000 ms: 16036 for internal timeout and 0 for cross node
timeout
INFO [ScheduledTasks:1] 2021-07-27 06:15:59,744
MessagingService.java:1013 - READ messages were dropped in
last 5000 ms: 287 for internal timeout and 0 for cross node
timeout
* Also, I am checking cfstats and proxyhistorgrams is in progress,
will update incase anythings suspicious.
On Tue, 27 Jul 2021 at 18:09, Bowen Song <bo...@bso.ng> wrote:
Hello Chahat,
You haven't replied to the first point, are you using the
"cassandra" user?
The schema and your description don't quite match. When you said:
/the system_auth for 2 DCs : //*us-east*//with 6 nodes (and
RF=3) and ...
/
I assume you meant to say 6 nodes and RF=6? That sounds totally fine.
As of the GC logs, do you see any WARN level logs? You may also
want to double check the choice of the gc_warn_threshold_in_ms
value is reasonable. I'm a bit surprised to see the GC logs full
of ~1.6ms GC pauses. AFAIK, by default Cassandra only logs GC
pauses longer than 200ms. I have to assume that's the result of
your custom script. You may want to double check the script
actually does what it says on the tin, just in case it misses
something.
Then, the other relevant questions.
Do you use cross node timeout? If so, are the clocks in sync
between all servers?
What is the timeout values in the cassandra.yaml file?
(read_request_timeout_in_ms, etc.) Are they reasonably long enough
for the corresponding request type to complete?
Since you've only got 7 nodes, I'd also recommend you to check the
nodetool cfstats & nodetool cfhistograms output for the tables in
the system_auth keyspace on each node, and see if you can spot
anything out of ordinary.
Regards,
Bowen
On 27/07/2021 09:23, Chahat Bhatia wrote:
Thanks for the prompt response.
*
Here is the system_schema.keyspaces entry:*
system_auth | True | {'class':
'org.apache.cassandra.locator.NetworkTopologyStrategy',
'us-east': '6', 'us-east-backup': '1'}
census | True | {'class':
'org.apache.cassandra.locator.NetworkTopologyStrategy',
'us-east': '3', 'us-east-backup': '1'}
So, the system_auth for 2 DCs : /*us-east* with 6 nodes (and
RF=3) and *us-east-backup* with a single node and 100% of the data./
*Keyspace census* showing RF=3 for the main DC and RF=1 for the
backup site.
And for running the application, we have a user specifically
created for that application itself and that user also has
its appropriate permissions in cassandra to select, modify
and delete from concerned tables. And it uses LOCAL_QUORUM
for querying the data and the local-dc is set to 'us-east'.
Also, there is no excessive GC for any of the nodes, we run a
custom script to trackthe GC stats (from the cassandra log
itself) and output it. Below is the output from the current
running script and its similar for all the servers:
2021-07-27 03:04:45,072 INFO gcstats:58 - Application
Thread stop time 0.001656 seconds.
2021-07-27 03:04:45,080 INFO gcstats:58 - Application
Thread stop time 0.001669 seconds.
2021-07-27 03:04:45,087 INFO gcstats:58 - Application
Thread stop time 0.001601 seconds.
2021-07-27 03:04:45,095 INFO gcstats:58 - Application
Thread stop time 0.001713 seconds.
2021-07-27 03:04:45,103 INFO gcstats:58 - Application
Thread stop time 0.001586 seconds.
2021-07-27 03:04:45,110 INFO gcstats:58 - Application
Thread stop time 0.001671 seconds.
2021-07-27 03:04:45,118 INFO gcstats:58 - Application
Thread stop time 0.001691 seconds.
2021-07-27 03:04:45,127 INFO gcstats:58 - Application
Thread stop time 0.001860 seconds.
2021-07-27 03:04:45,134 INFO gcstats:58 - Application
Thread stop time 0.001630 seconds.
2021-07-27 03:04:45,141 INFO gcstats:58 - Application
Thread stop time 0.001515 seconds.
2021-07-27 03:04:45,148 INFO gcstats:58 - Application
Thread stop time 0.001533 seconds.
2021-07-27 03:04:45,156 INFO gcstats:58 - Application
Thread stop time 0.001630 seconds.
2021-07-27 03:04:45,163 INFO gcstats:58 - Application
Thread stop time 0.001577 seconds.
2021-07-27 03:04:45,170 INFO gcstats:58 - Application
Thread stop time 0.001538 seconds.
2021-07-27 03:04:45,177 INFO gcstats:58 - Application
Thread stop time 0.001615 seconds.
2021-07-27 03:04:45,186 INFO gcstats:58 - Application
Thread stop time 0.001584 seconds.
2021-07-27 03:04:45,193 INFO gcstats:58 - Application
Thread stop time 0.001558 seconds.
2021-07-27 03:04:45,200 INFO gcstats:58 - Application
Thread stop time 0.001696 seconds.
2021-07-27 03:04:45,208 INFO gcstats:58 - Application
Thread stop time 0.001658 seconds.
2021-07-27 03:04:45,215 INFO gcstats:58 - Application
Thread stop time 0.001592 seconds.
2021-07-27 03:04:45,222 INFO gcstats:58 - Application
Thread stop time 0.001618 seconds.
2021-07-27 03:05:08,907 INFO gcstats:58 - Application
Thread stop time 0.001624 seconds.
2021-07-27 03:06:34,436 INFO gcstats:58 - Application
Thread stop time 0.297773 seconds.
On Tue, 27 Jul 2021 at 13:23, 'Bowen Song' via Infra Updates
<infra-upda...@goevive.com <mailto:infra-upda...@goevive.com>> wrote:
Hello Chahat,
First, can you please make sure the Cassandra user used by
the application is not "cassandra"? Because the "cassandra"
user uses QUORUM consistency level to read the auth tables.
Then, can you please make sure the replication strategy is
set correctly for the system_auth namespace? I.e.: ensure the
old DC is not present, and the new DC has sufficient number
of replicas for fault tolerance.
Finally, can you please check the GC logs, and make sure
there isn't JVM GC issues, espicially long STW pauses?
Regards,
Bowen
On 27/07/2021 08:34, Chahat Bhatia wrote:
Hi Community,
Context: We are running a cluster of 6 nodes in production
with a RF=3 in AWS.
We recently moved from physical servers to cloud by adding a
new DC and then removing the old one. Everything is working
fine in all the other applications except this one.
*As we recently started experiencing read timeouts in one of
our production applications where the client threw
*
Error An unexpected error occurred server side
onip-IP.ec2.internal:com.google.common.util.concurrent.*UncheckedExecutionException*:*com.google.common.util.concurrent.UncheckedExecutionException:**java.lang.RuntimeException:org.apache.cassandra.exceptions.ReadTimeoutException:Operation
timed out-received
only0responses.*com.datastax.driver.core.exceptions.ServerError:An
unexpected error occurred server side
:com.google.common.util.concurrent.UncheckedExecutionException:com.google.common.util.concurrent.UncheckedExecutionException:java.lang.RuntimeException:org.apache.cassandra.exceptions.ReadTimeoutException:Operation
timed out-received only0responses.
at
com.datastax.driver.core.exceptions.ServerError.copy(ServerError.java:63)~[cassandra-driver-core-3.3.0-shaded.jar!/:?]at
com.datastax.driver.core.exceptions.ServerError.copy(ServerError.java:25)~[cassandra-driver-core-3.3.0-shaded.jar!/:?]at
com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:37)~[cassandra-driver-core-3.3.0-shaded.jar!/:?]at
com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:245)~[cassandra-driver-core-3.3.0-shaded.jar!/:?]at
com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:68)~[cassandra-driver-core-3.3.0-shaded.jar!/:?
]
............ cntd
com.datastax.driver.core.exceptions.ReadTimeoutException:
Cassandra timeout during read query at consistency
LOCAL_QUORUM (2 responses were required but only 1
replica responded)
*And around the same time these were the errors on the
server side (from cassandra logs):
*
*ERROR [RolesCacheRefresh:1] 2021-07-26 06:32:43,094
CassandraDaemon.java:207 - Exception in thread
Thread[RolesCacheRefresh:1,5,main]
java.lang.RuntimeException:
org.apache.cassandra.exceptions.ReadTimeoutException:
Operation timed out - received only 0 responses.
* at
org.apache.cassandra.auth.CassandraRoleManager.getRole(CassandraRoleManager.java:512)
~[apache-cassandra-3.0.13.jar:3.0.13]
at
org.apache.cassandra.auth.CassandraRoleManager.getRoles(CassandraRoleManager.java:280)
~[apache-cassandra-3.0.13.jar:3.0.13]
at
org.apache.cassandra.auth.RolesCache$1$1.call(RolesCache.java:135)
~[apache-cassandra-3.0.13.jar:3.0.13]
at
org.apache.cassandra.auth.RolesCache$1$1.call(RolesCache.java:130)
~[apache-cassandra-3.0.13.jar:3.0.13]
at
java.util.concurrent.FutureTask.run(FutureTask.java:266)
~[na:1.8.0_131]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
~[na:1.8.0_131]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[na:1.8.0_131]
at
org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79)
[apache-cassandra-3.0.13.jar:3.0.13]
at java.lang.Thread.run(Thread.java:748)
~[na:1.8.0_131]
Caused by:
org.apache.cassandra.exceptions.ReadTimeoutException:
Operation timed out - received only 0 responses.
at
org.apache.cassandra.service.ReadCallback.awaitResults(ReadCallback.java:132)
~[apache-cassandra-3.0.13.jar:3.0.13]
at
org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:137)
~[apache-cassandra-3.0.13.jar:3.0.13]
at
org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:145)
~[apache-cassandra-3.0.13.jar:3.0.13]
at
org.apache.cassandra.service.StorageProxy$SinglePartitionReadLifecycle.awaitResultsAndRetryOnDigestMismatch(StorageProxy.java:1715)
~[apache-cassandra-3.0.13.jar:3.0.13]
at
org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1664)
~[apache-cassandra-3.0.13.jar:3.0.13]
at
org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1605)
~[apache-cassandra-3.0.13.jar:3.0.13]
at
org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1524)
~[apache-cassandra-3.0.13.jar:3.0.13]
at
org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:955)
~[apache-cassandra-3.0.13.jar:3.0.13]
at
org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:263)
~[apache-cassandra-3.0.13.jar:3.0.13]
at
org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:224)
~[apache-cassandra-3.0.13.jar:3.0.13]
at
org.apache.cassandra.auth.CassandraRoleManager.getRoleFromTable(CassandraRoleManager.java:520)
~[apache-cassandra-3.0.13.jar:3.0.13]
at
org.apache.cassandra.auth.CassandraRoleManager.getRole(CassandraRoleManager.java:502)
~[apache-cassandra-3.0.13.jar:3.0.13]
*ERROR [PermissionsCacheRefresh:1] 2021-07-26
07:11:25,804 CassandraDaemon.java:207 - Exception in
thread Thread[PermissionsCacheRefresh:1,5,main]
java.lang.RuntimeException:
org.apache.cassandra.exceptions.ReadTimeoutException:
Operation timed out - received only 0 responses.
* at
org.apache.cassandra.auth.CassandraRoleManager.getRole(CassandraRoleManager.java:512)
~[apache-cassandra-3.0.13.jar:3.0.13]
at
org.apache.cassandra.auth.CassandraRoleManager.isSuper(CassandraRoleManager.java:304)
~[apache-cassandra-3.0.13.jar:3.0.13]
at
org.apache.cassandra.auth.Roles.hasSuperuserStatus(Roles.java:52)
~[apache-cassandra-3.0.13.jar:3.0.13]
at
org.apache.cassandra.auth.AuthenticatedUser.isSuper(AuthenticatedUser.java:71)
~[apache-cassandra-3.0.13.jar:3.0.13]
at
org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:76)
~[apache-cassandra-3.0.13.jar:3.0.13]
at
org.apache.cassandra.auth.PermissionsCache$1$1.call(PermissionsCache.java:136)
~[apache-cassandra-3.0.13.jar:3.0.13]
at
org.apache.cassandra.auth.PermissionsCache$1$1.call(PermissionsCache.java:131)
~[apache-cassandra-3.0.13.jar:3.0.13]
at
java.util.concurrent.FutureTask.run(FutureTask.java:266)
~[na:1.8.0_131]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
~[na:1.8.0_131]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[na:1.8.0_131]
at
org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79)
[apache-cassandra-3.0.13.jar:3.0.13]
at java.lang.Thread.run(Thread.java:748)
~[na:1.8.0_131]
Caused by:
org.apache.cassandra.exceptions.ReadTimeoutException:
Operation timed out - received only 0 responses.
at
org.apache.cassandra.service.ReadCallback.awaitResults(ReadCallback.java:132)
~[apache-cassandra-3.0.13.jar:3.0.13]
at
org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:137)
~[apache-cassandra-3.0.13.jar:3.0.13]
at
org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:145)
~[apache-cassandra-3.0.13.jar:3.0.13]
at
org.apache.cassandra.service.StorageProxy$SinglePartitionReadLifecycle.awaitResultsAndRetryOnDigestMismatch(StorageProxy.java:1715)
~[apache-cassandra-3.0.13.jar:3.0.13]
at
org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1664)
~[apache-cassandra-3.0.13.jar:3.0.13]
at
org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1605)
~[apache-cassandra-3.0.13.jar:3.0.13]
at
org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1524)
~[apache-cassandra-3.0.13.jar:3.0.13]
at
org.apache.cassandra.db.SinglePartitionReadCommand$Group.execute(SinglePartitionReadCommand.java:955)
~[apache-cassandra-3.0.13.jar:3.0.13]
at
org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:263)
~[apache-cassandra-3.0.13.jar:3.0.13]
at
org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:224)
~[apache-cassandra-3.0.13.jar:3.0.13]
at
org.apache.cassandra.auth.CassandraRoleManager.getRoleFromTable(CassandraRoleManager.java:520)
~[apache-cassandra-3.0.13.jar:3.0.13]
at
org.apache.cassandra.auth.CassandraRoleManager.getRole(CassandraRoleManager.java:502)
~[apache-cassandra-3.0.13.jar:3.0.13]
*These are the values of these params in my configuration file
*
permissions_validity_in_ms: 300000
permissions_update_interval_in_ms: 20000
roles_validity_in_ms: 300000
roles_update_interval_in_ms: 20000
This was not the case earlier and since this comes from a
single app alone we are not sure if this is actually the
issue. Can anyone please point out if these values are
misconfigured and hence causing the issue or is it somewhere
else we should be looking at?
Any help would be appreciated.
Thanks & Regards,
Chahat.
--
Thanks & Regards,
Chahat Bhatia
Systems Engineer
*Evive*
+91 7087629779
--
Thanks & Regards,
Chahat Bhatia
Systems Engineer
*Evive*
+91 7087629779