Re: Permission/Role Cache causing timeouts in apps.

2021-07-27 Thread Chahat Bhatia
Okay. Sure. Thanks a lot for all the information. Really helped. :) On Tue, 27 Jul 2021 at 21:05, Bowen Song wrote: > Based on the information I know, I'd say that you don't have any specific > issue with the authentication related tables, but you do have a general > overloading problem during

Re: Permission/Role Cache causing timeouts in apps.

2021-07-27 Thread Bowen Song
Based on the information I know, I'd say that you don't have any specific issue with the authentication related tables, but you do have a general overloading problem during peak load. I think it's fairly likely that your 7 nodes cluster (6 nodes in one DC) is not able to keep up with the peak

Re: Permission/Role Cache causing timeouts in apps.

2021-07-27 Thread Chahat Bhatia
Yes, the application in quite read heavy and the request pattern is bursty too. Hence that big a request failure in such less time. Also, nothing out of the ordinary in cfstats and proxyhistograms. But there are Native-Transport-Requests dropped messages (Almost similar stats on all the nodes) :

Re: Permission/Role Cache causing timeouts in apps.

2021-07-27 Thread Bowen Song
Wow, 15 seconds timeout? That's pretty long... You may want to check the nodetool tpstats and make sure the NTP thread pool isn't blocking things. 16k read requests dropped in 5 seconds, or over 3k requests per second on a single node, is a bit suspicious. Does your read requests tend to be

Re: Permission/Role Cache causing timeouts in apps.

2021-07-27 Thread Chahat Bhatia
Yes, RF=6 for system auth. Sorry my bad. No, we are not using cassandra user for the application. We have a custom super user for our operational and administrative tasks and a separate role with needed perms for the application. > role | super | login | options >

Re: Permission/Role Cache causing timeouts in apps.

2021-07-27 Thread Bowen Song
Hello Chahat, You haven't replied to the first point, are you using the "cassandra" user? The schema and your description don't quite match. When you said: // /the system_auth  for 2 DCs : //*us-east*//with 6 nodes (and RF=3) and ... / I assume you meant to say 6 nodes and RF=6?

Re: Permission/Role Cache causing timeouts in apps.

2021-07-27 Thread Chahat Bhatia
> > Also, It's interesting that you've set validity to over 3 days but you > update them every 6 hours. Is that intentional? We set that earlier when were in the process to add new roles (creating new roles for the new apps we setup) but we never changed after that and hence its been the same

Re: Permission/Role Cache causing timeouts in apps.

2021-07-27 Thread Chahat Bhatia
Thanks for the prompt response. *Here is the system_schema.keyspaces entry:* system_auth | True | {'class': > 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'us-east': '6', > 'us-east-backup': '1'} > census | True | {'class': >

Re: Permission/Role Cache causing timeouts in apps.

2021-07-27 Thread Bowen Song
Hello Chahat, First, can you please make sure the Cassandra user used by the application is not "cassandra"? Because the "cassandra" user uses QUORUM consistency level to read the auth tables. Then, can you please make sure the replication strategy is set correctly for the system_auth

Re: Permission/Role Cache causing timeouts in apps.

2021-07-27 Thread Erick Ramirez
Are you using the default `cassandra` superuser role? Because that would be expensive. Also confirm if you've set the replication for the `system_auth` keyspace to NTS because if you have multiple DCs, the request could be going to another DC. It's interesting that you've set validity to over 3