Thanks Elliott, yepp! This is exactly what we also figured out as a next step. Upgrade our TEST env to that so we can re-evaluate the test we did.
Makes 100% sense

Attila Wind

http://www.linkedin.com/in/attilaw <http://www.linkedin.com/in/attilaw>
Mobile: +49 176 43556932


27.01.2021 10:18 keltezéssel, Elliott Sims írta:
To start with, maybe update to beta4.  There's an absolute massive list of fixes since alpha4.  I don't think the alphas are expected to be in a usable/low-bug state necessarily, where beta4 is approaching RC status.

On Tue, Jan 26, 2021, 10:44 PM Attila Wind <attilaw@swf.technology> wrote:

    Hey All,

    I'm coming back on my own question (see below) as this has
    happened again to us 2 days later so we took the time to further
    analyse this issue. I'd like to share our experiences and the
    workaround which we figured out too.

    So to just quickly sum up the most important details again:

      * we have a 3 nodes cluster - Cassandra 4-alpha4 and RF=2 - in
        one DC
      * we are using ONE consistency level in all queries
      * if we lose one node from the cluster then
          o non-counter table writes are fine, remaining 2 nodes
            taking over everything
          o but counter table writes start to fail with exception
            "com.datastax.driver.core.exceptions.WriteTimeoutException:
            Cassandra timeout during COUNTER write query at
            consistency ONE (1 replica were required but only 0
            acknowledged the write)"
          o the two remaining nodes are both producing hints files for
            the fallen one
      * just a note: counter_write_request_timeout_in_ms = 10000,
        write_request_timeout_in_ms = 5000 in our cassandra.yaml

    To test this further bit we did the following:

      * we shut down one of the nodes normally
        In this case we do not have the above behavior - everything
        happens as it should, no failures on counter table writes
        so this is good
      * we reproduced the issue in our TEST env by hard-killing one of
        the nodes instead of normal shutdown (simulating a hardware
        failure as we had in PROD)
        Bingo, issue starts immediately!

    Based on the above observations the "normal shutdown - no problem"
    case gave an idea - so now we have a workaround how to get back
    the cluster into a working state in a case if we would lose a node
    permanently (or for a long time at least)

     1. (in our case) we stop the App to stop all Cassandra operations
     2. stop all remaining nodes in the cluster normally
     3. restart them normally

    This way the remaining nodes realize the failed node is down and
    they are jumping into expected processing - everything works
    including counter table writes

    If anyone has any idea what to check / change / do in our cluster
    I'm all ears! :-)

    thanks

    Attila Wind

    http://www.linkedin.com/in/attilaw
    <http://www.linkedin.com/in/attilaw>
    Mobile: +49 176 43556932


    22.01.2021 07:35 keltezéssel, Attila Wind írta:

    Hey guys,

    Yesterday we had an outage after we have lost a node and we saw
    such a behavior we can not explain.

    Our data schema has both: counter and norma tables. And we have
    replicationFactor = 2 and consistency level LOCAL_ONE (explicitly
    set)

    What we saw:
    After a node went down the updates of the counter tables slowed
    down. A lot! These updates normally take only a few millisecs but
    now started to take 30-60 seconds(!)
    At the same time the write ops against non-counter tables did not
    show any difference. The app log was silent in a sense of errors.
    So the queries - including the counter table updates - were not
    failing (otherwise we see exceptions coming from DAO layer
    originating from Cassandra driver) at all.
    One more thing: only those updates suffered from the above huuuge
    wait time where the lost node was involved (due to partition
    key). Other updates just went fine

    The whole stuff looks like Cassandra internally started to wait -
    a lot - for the lost node. Updates finally succeeded without
    failure - at least for the App (the client)

    Did anyone ever experienced similar behavior?
    What could be an explanation for the above?

    Some more details: the App is implemented in Java 8, we are using
    Datastax driver 3.7.1 and server cluster is running on Cassandra
    4.0 alpha 4. Cluster size is 3 nodes.

    Any feedback is appreciated! :-)

    thanks

-- Attila Wind

    http://www.linkedin.com/in/attilaw
    <http://www.linkedin.com/in/attilaw>
    Mobile: +49 176 43556932


Reply via email to