Because an operator will need to check and ensure the schema is consistent across the cluster before running "nodetool cleanup". At the moment, it's the operator's responsibility to ensure bad things don't happen.

On 09/05/2023 06:20, Jaydeep Chovatia wrote:
One clarification question Jeff.
AFAIK, the /nodetool cleanup/ also internally goes through the same compaction path as the regular compaction. Then why do we have to wait for CEP-21 to clean up unowned data in the regular compaction path? Wouldn't it be as simple as regular compaction just invoke the code of /nodetool cleanup/? In other words, without CEP-21, why is /nodetool cleanup/ a safer operation but doing the same in the regular compaction isn't?

Jaydeep

On Fri, May 5, 2023 at 11:58 AM Jaydeep Chovatia <chovatia.jayd...@gmail.com> wrote:

    Thanks, Jeff, for the detailed steps and summary.
    We will keep the community (this thread) up to date on how it
    plays out in our fleet.

    Jaydeep

    On Fri, May 5, 2023 at 9:10 AM Jeff Jirsa <jji...@gmail.com> wrote:

        Lots of caveats on these suggestions, let me try to hit most
        of them.

        Cleanup in parallel is good and fine and common. Limit number
        of threads in cleanup if you're using lots of vnodes, so each
        node runs one at a time and not all nodes use all your cores
        at the same time.
        If a host is fully offline, you can ALSO use replace address
        first boot. It'll stream data right to that host with the same
        token assignments you had before, and no cleanup is needed
        then. Strictly speaking, to avoid resurrection here, you'd
        want to run repair on the replicas of the down host (for
        vnodes, probably the whole cluster), but your current process
        doesnt guarantee that either (decom + bootstrap may resurrect,
        strictly speaking).
        Dropping vnodes will reduce the replicas that have to be
        cleaned up, but also potentially increase your imbalance on
        each replacement.

        Cassandra should still do this on its own, and I think once
        CEP-21 is committed, this should be one of the first
        enhancement tickets.

        Until then, LeveledCompactionStrategy really does make cleanup
        fast and cheap, at the cost of higher IO the rest of the time.
        If you can tolerate that higher IO, you'll probably appreciate
        LCS anyway (faster reads, faster data deletion than STCS).
        It's a lot of IO compared to STCS though.


        On Fri, May 5, 2023 at 9:02 AM Jaydeep Chovatia
        <chovatia.jayd...@gmail.com> wrote:

            Thanks all for your valuable inputs. We will try some of
            the suggested methods in this thread, and see how it goes.
            We will keep you updated on our progress.
            Thanks a lot once again!

            Jaydeep

            On Fri, May 5, 2023 at 8:55 AM Bowen Song via user
            <user@cassandra.apache.org> wrote:

                Depending on the number of vnodes per server, the
                probability and severity (i.e. the size of the
                affected token ranges) of an availability degradation
                due to a server failure during node replacement may be
                small. You also have the choice of increasing the RF
                if that's still not acceptable.

                Also, reducing number of vnodes per server can limit
                the number of servers affected by replacing a single
                server, therefore reducing the amount of time required
                to run "nodetool cleanup" if it is run sequentially.

                Finally, you may choose to run "nodetool cleanup"
                concurrently on multiple nodes to reduce the amount of
                time required to complete it.


                On 05/05/2023 16:26, Runtian Liu wrote:
                We are doing the "adding a node then decommissioning
                a node" to achieve better availability. Replacing a
                node need to shut down one node first, if another
                node is down during the node replacement period, we
                will get availability drop because most of our use
                case is local_quorum with replication factor 3.

                On Fri, May 5, 2023 at 5:59 AM Bowen Song via user
                <user@cassandra.apache.org> wrote:

                    Have you thought of using
                    "-Dcassandra.replace_address_first_boot=..." (or
                    "-Dcassandra.replace_address=..." if you are
                    using an older version)? This will not result in
                    a topology change, which means "nodetool cleanup"
                    is not needed after the operation is completed.

                    On 05/05/2023 05:24, Jaydeep Chovatia wrote:
                    Thanks, Jeff!
                    But in our environment we replace nodes quite
                    often for various optimization purposes, etc.
                    say, almost 1 node per day (node /addition/
                    followed by node /decommission/, which of course
                    changes the topology), and we have a cluster of
                    size 100 nodes with 300GB per node. If we have
                    to run cleanup on 100 nodes after every
                    replacement, then it could take forever.
                    What is the recommendation until we get this
                    fixed in Cassandra itself as part of compaction
                    (w/o externally triggering /cleanup/)?

                    Jaydeep

                    On Thu, May 4, 2023 at 8:14 PM Jeff Jirsa
                    <jji...@gmail.com> wrote:

                        Cleanup is fast and cheap and basically a
                        no-op if you haven’t changed the ring

                        After cassandra has transactional cluster
                        metadata to make ring changes strongly
                        consistent, cassandra should do this in
                        every compaction. But until then it’s left
                        for operators to run when they’re sure the
                        state of the ring is correct .



                        On May 4, 2023, at 7:41 PM, Jaydeep
                        Chovatia <chovatia.jayd...@gmail.com> wrote:

                        
                        Isn't this considered a kind of *bug* in
                        Cassandra because as we know /cleanup/ is a
                        lengthy and unreliable operation, so
                        relying on the /cleanup/ means higher
                        chances of data resurrection?
                        Do you think we should discard the unowned
                        token-ranges as part of the regular
                        compaction itself? What are the pitfalls of
                        doing this as part of compaction itself?

                        Jaydeep

                        On Thu, May 4, 2023 at 7:25 PM guo Maxwell
                        <cclive1...@gmail.com> wrote:

                            compact ion will just merge duplicate
                            data and remove delete data in this
                            node .if you add or remove one node for
                            the cluster, I think clean up is
                            needed. if clean up failed, I think we
                            should come to see the reason.

                            Runtian Liu <curly...@gmail.com>
                            于2023年5月5日周五 06:37写道:

                                Hi all,

                                Is cleanup the sole method to
                                remove data that does not belong to
                                a specific node? In a cluster,
                                where nodes are added or
                                decommissioned from time to time,
                                failure to run cleanup may lead to
                                data resurrection issues, as
                                deleted data may remain on the node
                                that lost ownership of certain
                                partitions. Or is it true that
                                normal compactions can also handle
                                data removal for nodes that no
                                longer have ownership of certain data?

                                Thanks,
                                Runtian



-- you are the apple of my eye !

Reply via email to