Hey All,

I'm coming back on my own question (see below) as this has happened again to us 2 days later so we took the time to further analyse this issue. I'd like to share our experiences and the workaround which we figured out too.

So to just quickly sum up the most important details again:

 * we have a 3 nodes cluster - Cassandra 4-alpha4 and RF=2 - in one DC
 * we are using ONE consistency level in all queries
 * if we lose one node from the cluster then
     o non-counter table writes are fine, remaining 2 nodes taking over
       everything
     o but counter table writes start to fail with exception
       "com.datastax.driver.core.exceptions.WriteTimeoutException:
       Cassandra timeout during COUNTER write query at consistency ONE
       (1 replica were required but only 0 acknowledged the write)"
     o the two remaining nodes are both producing hints files for the
       fallen one
 * just a note: counter_write_request_timeout_in_ms = 10000,
   write_request_timeout_in_ms = 5000 in our cassandra.yaml

To test this further bit we did the following:

 * we shut down one of the nodes normally
   In this case we do not have the above behavior - everything happens
   as it should, no failures on counter table writes
   so this is good
 * we reproduced the issue in our TEST env by hard-killing one of the
   nodes instead of normal shutdown (simulating a hardware failure as
   we had in PROD)
   Bingo, issue starts immediately!

Based on the above observations the "normal shutdown - no problem" case gave an idea - so now we have a workaround how to get back the cluster into a working state in a case if we would lose a node permanently (or for a long time at least)

1. (in our case) we stop the App to stop all Cassandra operations
2. stop all remaining nodes in the cluster normally
3. restart them normally

This way the remaining nodes realize the failed node is down and they are jumping into expected processing - everything works including counter table writes

If anyone has any idea what to check / change / do in our cluster I'm all ears! :-)

thanks

Attila Wind

http://www.linkedin.com/in/attilaw <http://www.linkedin.com/in/attilaw>
Mobile: +49 176 43556932


22.01.2021 07:35 keltezéssel, Attila Wind írta:

Hey guys,

Yesterday we had an outage after we have lost a node and we saw such a behavior we can not explain.

Our data schema has both: counter and norma tables. And we have replicationFactor = 2 and consistency level LOCAL_ONE (explicitly set)

What we saw:
After a node went down the updates of the counter tables slowed down. A lot! These updates normally take only a few millisecs but now started to take 30-60 seconds(!) At the same time the write ops against non-counter tables did not show any difference. The app log was silent in a sense of errors. So the queries - including the counter table updates - were not failing (otherwise we see exceptions coming from DAO layer originating from Cassandra driver) at all. One more thing: only those updates suffered from the above huuuge wait time where the lost node was involved (due to partition key). Other updates just went fine

The whole stuff looks like Cassandra internally started to wait - a lot - for the lost node. Updates finally succeeded without failure - at least for the App (the client)

Did anyone ever experienced similar behavior?
What could be an explanation for the above?

Some more details: the App is implemented in Java 8, we are using Datastax driver 3.7.1 and server cluster is running on Cassandra 4.0 alpha 4. Cluster size is 3 nodes.

Any feedback is appreciated! :-)

thanks

--
Attila Wind

http://www.linkedin.com/in/attilaw <http://www.linkedin.com/in/attilaw>
Mobile: +49 176 43556932


Reply via email to