Hey All,
I'm coming back on my own question (see below) as this has happened
again to us 2 days later so we took the time to further analyse this
issue. I'd like to share our experiences and the workaround which we
figured out too.
So to just quickly sum up the most important details again:
* we have a 3 nodes cluster - Cassandra 4-alpha4 and RF=2 - in one DC
* we are using ONE consistency level in all queries
* if we lose one node from the cluster then
o non-counter table writes are fine, remaining 2 nodes taking over
everything
o but counter table writes start to fail with exception
"com.datastax.driver.core.exceptions.WriteTimeoutException:
Cassandra timeout during COUNTER write query at consistency ONE
(1 replica were required but only 0 acknowledged the write)"
o the two remaining nodes are both producing hints files for the
fallen one
* just a note: counter_write_request_timeout_in_ms = 10000,
write_request_timeout_in_ms = 5000 in our cassandra.yaml
To test this further bit we did the following:
* we shut down one of the nodes normally
In this case we do not have the above behavior - everything happens
as it should, no failures on counter table writes
so this is good
* we reproduced the issue in our TEST env by hard-killing one of the
nodes instead of normal shutdown (simulating a hardware failure as
we had in PROD)
Bingo, issue starts immediately!
Based on the above observations the "normal shutdown - no problem" case
gave an idea - so now we have a workaround how to get back the cluster
into a working state in a case if we would lose a node permanently (or
for a long time at least)
1. (in our case) we stop the App to stop all Cassandra operations
2. stop all remaining nodes in the cluster normally
3. restart them normally
This way the remaining nodes realize the failed node is down and they
are jumping into expected processing - everything works including
counter table writes
If anyone has any idea what to check / change / do in our cluster I'm
all ears! :-)
thanks
Attila Wind
http://www.linkedin.com/in/attilaw <http://www.linkedin.com/in/attilaw>
Mobile: +49 176 43556932
22.01.2021 07:35 keltezéssel, Attila Wind írta:
Hey guys,
Yesterday we had an outage after we have lost a node and we saw such a
behavior we can not explain.
Our data schema has both: counter and norma tables. And we have
replicationFactor = 2 and consistency level LOCAL_ONE (explicitly set)
What we saw:
After a node went down the updates of the counter tables slowed down.
A lot! These updates normally take only a few millisecs but now
started to take 30-60 seconds(!)
At the same time the write ops against non-counter tables did not show
any difference. The app log was silent in a sense of errors. So the
queries - including the counter table updates - were not failing
(otherwise we see exceptions coming from DAO layer originating from
Cassandra driver) at all.
One more thing: only those updates suffered from the above huuuge wait
time where the lost node was involved (due to partition key). Other
updates just went fine
The whole stuff looks like Cassandra internally started to wait - a
lot - for the lost node. Updates finally succeeded without failure -
at least for the App (the client)
Did anyone ever experienced similar behavior?
What could be an explanation for the above?
Some more details: the App is implemented in Java 8, we are using
Datastax driver 3.7.1 and server cluster is running on Cassandra 4.0
alpha 4. Cluster size is 3 nodes.
Any feedback is appreciated! :-)
thanks
--
Attila Wind
http://www.linkedin.com/in/attilaw <http://www.linkedin.com/in/attilaw>
Mobile: +49 176 43556932