Re: strange behavior of counter tables after losing a node

Elliott Sims Wed, 27 Jan 2021 01:19:10 -0800

To start with, maybe update to beta4.  There's an absolute massive list of
fixes since alpha4.  I don't think the alphas are expected to be in a
usable/low-bug state necessarily, where beta4 is approaching RC status.


On Tue, Jan 26, 2021, 10:44 PM Attila Wind <attilaw@swf.technology> wrote:

> Hey All,
>
> I'm coming back on my own question (see below) as this has happened again
> to us 2 days later so we took the time to further analyse this issue. I'd
> like to share our experiences and the workaround which we figured out too.
>
> So to just quickly sum up the most important details again:
>
>    - we have a 3 nodes cluster - Cassandra 4-alpha4 and RF=2 - in one DC
>    - we are using ONE consistency level in all queries
>    - if we lose one node from the cluster then
>       - non-counter table writes are fine, remaining 2 nodes taking over
>       everything
>       - but counter table writes start to fail with exception
>       "com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra
>       timeout during COUNTER write query at consistency ONE (1 replica were
>       required but only 0 acknowledged the write)"
>       - the two remaining nodes are both producing hints files for the
>       fallen one
>    - just a note: counter_write_request_timeout_in_ms = 10000,
>    write_request_timeout_in_ms = 5000 in our cassandra.yaml
>
> To test this further bit we did the following:
>
>    - we shut down one of the nodes normally
>    In this case we do not have the above behavior - everything happens as
>    it should, no failures on counter table writes
>    so this is good
>    - we reproduced the issue in our TEST env by hard-killing one of the
>    nodes instead of normal shutdown (simulating a hardware failure as we had
>    in PROD)
>    Bingo, issue starts immediately!
>
> Based on the above observations the "normal shutdown - no problem" case
> gave an idea - so now we have a workaround how to get back the cluster into
> a working state in a case if we would lose a node permanently (or for a
> long time at least)
>
>    1. (in our case) we stop the App to stop all Cassandra operations
>    2. stop all remaining nodes in the cluster normally
>    3. restart them normally
>
> This way the remaining nodes realize the failed node is down and they are
> jumping into expected processing - everything works including counter table
> writes
>
> If anyone has any idea what to check / change / do in our cluster I'm all
> ears! :-)
>
> thanks
> Attila Wind
>
> http://www.linkedin.com/in/attilaw
> Mobile: +49 176 43556932
>
>
> 22.01.2021 07:35 keltezéssel, Attila Wind írta:
>
> Hey guys,
>
> Yesterday we had an outage after we have lost a node and we saw such a
> behavior we can not explain.
>
> Our data schema has both: counter and norma tables. And we have
> replicationFactor = 2 and consistency level LOCAL_ONE (explicitly set)
>
> What we saw:
> After a node went down the updates of the counter tables slowed down. A
> lot! These updates normally take only a few millisecs but now started to
> take 30-60 seconds(!)
> At the same time the write ops against non-counter tables did not show any
> difference. The app log was silent in a sense of errors. So the queries -
> including the counter table updates - were not failing (otherwise we see
> exceptions coming from DAO layer originating from Cassandra driver) at all.
> One more thing: only those updates suffered from the above huuuge wait
> time where the lost node was involved (due to partition key). Other updates
> just went fine
>
> The whole stuff looks like Cassandra internally started to wait - a lot -
> for the lost node. Updates finally succeeded without failure - at least for
> the App (the client)
>
> Did anyone ever experienced similar behavior?
> What could be an explanation for the above?
>
> Some more details: the App is implemented in Java 8, we are using Datastax
> driver 3.7.1 and server cluster is running on Cassandra 4.0 alpha 4.
> Cluster size is 3 nodes.
>
> Any feedback is appreciated! :-)
>
> thanks
> --
> Attila Wind
>
> http://www.linkedin.com/in/attilaw
> Mobile: +49 176 43556932
>
>
>

Re: strange behavior of counter tables after losing a node

Reply via email to