To start with, maybe update to beta4. There's an absolute massive list of fixes since alpha4. I don't think the alphas are expected to be in a usable/low-bug state necessarily, where beta4 is approaching RC status.
On Tue, Jan 26, 2021, 10:44 PM Attila Wind <attilaw@swf.technology> wrote: > Hey All, > > I'm coming back on my own question (see below) as this has happened again > to us 2 days later so we took the time to further analyse this issue. I'd > like to share our experiences and the workaround which we figured out too. > > So to just quickly sum up the most important details again: > > - we have a 3 nodes cluster - Cassandra 4-alpha4 and RF=2 - in one DC > - we are using ONE consistency level in all queries > - if we lose one node from the cluster then > - non-counter table writes are fine, remaining 2 nodes taking over > everything > - but counter table writes start to fail with exception > "com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra > timeout during COUNTER write query at consistency ONE (1 replica were > required but only 0 acknowledged the write)" > - the two remaining nodes are both producing hints files for the > fallen one > - just a note: counter_write_request_timeout_in_ms = 10000, > write_request_timeout_in_ms = 5000 in our cassandra.yaml > > To test this further bit we did the following: > > - we shut down one of the nodes normally > In this case we do not have the above behavior - everything happens as > it should, no failures on counter table writes > so this is good > - we reproduced the issue in our TEST env by hard-killing one of the > nodes instead of normal shutdown (simulating a hardware failure as we had > in PROD) > Bingo, issue starts immediately! > > Based on the above observations the "normal shutdown - no problem" case > gave an idea - so now we have a workaround how to get back the cluster into > a working state in a case if we would lose a node permanently (or for a > long time at least) > > 1. (in our case) we stop the App to stop all Cassandra operations > 2. stop all remaining nodes in the cluster normally > 3. restart them normally > > This way the remaining nodes realize the failed node is down and they are > jumping into expected processing - everything works including counter table > writes > > If anyone has any idea what to check / change / do in our cluster I'm all > ears! :-) > > thanks > Attila Wind > > http://www.linkedin.com/in/attilaw > Mobile: +49 176 43556932 > > > 22.01.2021 07:35 keltezéssel, Attila Wind írta: > > Hey guys, > > Yesterday we had an outage after we have lost a node and we saw such a > behavior we can not explain. > > Our data schema has both: counter and norma tables. And we have > replicationFactor = 2 and consistency level LOCAL_ONE (explicitly set) > > What we saw: > After a node went down the updates of the counter tables slowed down. A > lot! These updates normally take only a few millisecs but now started to > take 30-60 seconds(!) > At the same time the write ops against non-counter tables did not show any > difference. The app log was silent in a sense of errors. So the queries - > including the counter table updates - were not failing (otherwise we see > exceptions coming from DAO layer originating from Cassandra driver) at all. > One more thing: only those updates suffered from the above huuuge wait > time where the lost node was involved (due to partition key). Other updates > just went fine > > The whole stuff looks like Cassandra internally started to wait - a lot - > for the lost node. Updates finally succeeded without failure - at least for > the App (the client) > > Did anyone ever experienced similar behavior? > What could be an explanation for the above? > > Some more details: the App is implemented in Java 8, we are using Datastax > driver 3.7.1 and server cluster is running on Cassandra 4.0 alpha 4. > Cluster size is 3 nodes. > > Any feedback is appreciated! :-) > > thanks > -- > Attila Wind > > http://www.linkedin.com/in/attilaw > Mobile: +49 176 43556932 > > >