Re: strange behavior of counter tables after losing a node

Attila Wind Tue, 26 Jan 2021 22:44:18 -0800

Hey All,

I'm coming back on my own question (see below) as this has happenedagain to us 2 days later so we took the time to further analyse thisissue. I'd like to share our experiences and the workaround which wefigured out too.


So to just quickly sum up the most important details again:

 * we have a 3 nodes cluster - Cassandra 4-alpha4 and RF=2 - in one DC
 * we are using ONE consistency level in all queries
 * if we lose one node from the cluster then
     o non-counter table writes are fine, remaining 2 nodes taking over
       everything
     o but counter table writes start to fail with exception
       "com.datastax.driver.core.exceptions.WriteTimeoutException:
       Cassandra timeout during COUNTER write query at consistency ONE
       (1 replica were required but only 0 acknowledged the write)"
     o the two remaining nodes are both producing hints files for the
       fallen one
 * just a note: counter_write_request_timeout_in_ms = 10000,
   write_request_timeout_in_ms = 5000 in our cassandra.yaml

To test this further bit we did the following:

 * we shut down one of the nodes normally
   In this case we do not have the above behavior - everything happens
   as it should, no failures on counter table writes
   so this is good
 * we reproduced the issue in our TEST env by hard-killing one of the
   nodes instead of normal shutdown (simulating a hardware failure as
   we had in PROD)
   Bingo, issue starts immediately!

Based on the above observations the "normal shutdown - no problem" casegave an idea - so now we have a workaround how to get back the clusterinto a working state in a case if we would lose a node permanently (orfor a long time at least)


1. (in our case) we stop the App to stop all Cassandra operations
2. stop all remaining nodes in the cluster normally
3. restart them normally

This way the remaining nodes realize the failed node is down and theyare jumping into expected processing - everything works includingcounter table writes

If anyone has any idea what to check / change / do in our cluster I'mall ears! :-)


thanks

Attila Wind

http://www.linkedin.com/in/attilaw <http://www.linkedin.com/in/attilaw>
Mobile: +49 176 43556932


22.01.2021 07:35 keltezéssel, Attila Wind írta:

Hey guys,
Yesterday we had an outage after we have lost a node and we saw such abehavior we can not explain.
Our data schema has both: counter and norma tables. And we havereplicationFactor = 2 and consistency level LOCAL_ONE (explicitly set)
What we saw:
After a node went down the updates of the counter tables slowed down.A lot! These updates normally take only a few millisecs but nowstarted to take 30-60 seconds(!)At the same time the write ops against non-counter tables did not showany difference. The app log was silent in a sense of errors. So thequeries - including the counter table updates - were not failing(otherwise we see exceptions coming from DAO layer originating fromCassandra driver) at all.One more thing: only those updates suffered from the above huuuge waittime where the lost node was involved (due to partition key). Otherupdates just went fine
The whole stuff looks like Cassandra internally started to wait - alot - for the lost node. Updates finally succeeded without failure -at least for the App (the client)
Did anyone ever experienced similar behavior?
What could be an explanation for the above?
Some more details: the App is implemented in Java 8, we are usingDatastax driver 3.7.1 and server cluster is running on Cassandra 4.0alpha 4. Cluster size is 3 nodes.
Any feedback is appreciated! :-)

thanks

--
Attila Wind

http://www.linkedin.com/in/attilaw <http://www.linkedin.com/in/attilaw>
Mobile: +49 176 43556932

Re: strange behavior of counter tables after losing a node

Reply via email to