Re: Failed disks - correct procedure

Joe Obernberger Mon, 16 Jan 2023 09:31:08 -0800

I'm using 4.1.0-1.

I've been doing a lot of truncates lately before the drive failed(research project). Current drives have about 100GBytes of data each,although the actual amount of data in Cassandra is much less (because oftruncates and snapshots). The cluster is not homo-genius; some nodeshave more drives than others.


nodetool status -r
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving

-- Address Load Tokens Owns HostID RackUN nyx.querymasters.com 7.9 GiB 250 ?07bccfce-45f1-41a3-a5c4-ee748a7a9b98 rack1UN enceladus.querymasters.com 6.34 GiB 200 ?274a6e8d-de37-4e0b-b000-02d221d858a5 rack1UN aion.querymasters.com 6.31 GiB 200 ?59150c47-274a-46fb-9d5e-bed468d36797 rack1UN calypso.querymasters.com 6.26 GiB 200 ?e83aa851-69b4-478f-88f6-60e657ea6539 rack1UN fortuna.querymasters.com 7.1 GiB 200 ?49e4f571-7d1c-4e1e-aca7-5bbe076596f7 rack1UN kratos.querymasters.com 6.36 GiB 200 ?0d9509cc-2f23-4117-a883-469a1be54baf rack1UN charon.querymasters.com 6.35 GiB 200 ?d9702f96-256e-45ae-8e12-69a42712be50 rack1UN eros.querymasters.com 6.4 GiB 200 ?93f9cb0f-ea71-4e3d-b62a-f0ea0e888c47 rack1UN ursula.querymasters.com 6.24 GiB 200 ?4bbbe57c-6219-41e5-bbac-de92a9594d53 rack1UN gaia.querymasters.com 6.28 GiB 200 ?b2e5366e-8386-40ec-a641-27944a5a7cfa rack1UN chaos.querymasters.com 3.78 GiB 120 ?08a19658-40be-4e55-8709-812b3d4ac750 rack1UN pallas.querymasters.com 6.24 GiB 200 ?b74b6e65-af63-486a-b07f-9e304ec30a39 rack1UN paradigm7.querymasters.com 16.25 GiB 500 ?1ccd2cc5-3ee5-43c5-a8c3-7065bdc24297 rack1UN aether.querymasters.com 6.36 GiB 200 ?352fd049-32f8-4be8-9275-68b145ac2832 rack1UN athena.querymasters.com 15.85 GiB 500 ?b088a8e6-42f3-4331-a583-47ef5149598f rack1


-Joe

On 1/16/2023 12:23 PM, Jeff Jirsa wrote:

Prior to cassandra-6696 you’d have to treat one missing disk as a failed 
machine, wipe all the data and re-stream it, as a tombstone for a given value 
may be on one disk and data on another (effectively redirecting data)

So the answer has to be version dependent, too - which version were you using?

On Jan 16, 2023, at 9:08 AM, Tolbert, Andy <x...@andrewtolbert.com> wrote:

Hi Joe,

Reading it back I realized I misunderstood that part of your email, so
you must be using data_file_directories with 16 drives?  That's a lot
of drives!  I imagine this may happen from time to time given that
disks like to fail.

That's a bit of an interesting scenario that I would have to think
about.  If you brought the node up without the bad drive, repairs are
probably going to do a ton of repair overstreaming if you aren't using
4.0 (https://issues.apache.org/jira/browse/CASSANDRA-3200) which may
put things into a really bad state (lots of streaming = lots of
compactions = slower reads) and you may be seeing some inconsistency
if repairs weren't regularly running beforehand.

How much data was on the drive that failed?  How much data do you
usually have per node?

Thanks,
Andy

On Mon, Jan 16, 2023 at 10:59 AM Joe Obernberger
<joseph.obernber...@gmail.com> wrote:

Thank you Andy.
Is there a way to just remove the drive from the cluster and replace it
later?  Ordering replacement drives isn't a fast process...
What I've done so far is:
Stop node
Remove drive reference from /etc/cassandra/conf/cassandra.yaml
Restart node
Run repair

Will that work?  Right now, it's showing all nodes as up.

-Joe

On 1/16/2023 11:55 AM, Tolbert, Andy wrote:
Hi Joe,

I'd recommend just doing a replacement, bringing up a new node with
-Dcassandra.replace_address_first_boot=ip.you.are.replacing as
described here:
https://cassandra.apache.org/doc/4.1/cassandra/operating/topo_changes.html#replacing-a-dead-node

Before you do that, you will want to make sure a cycle of repairs has
run on the replicas of the down node to ensure they are consistent
with each other.

Make sure you also have 'auto_bootstrap: true' in the yaml of the node
you are replacing and that the initial_token matches the node you are
replacing (If you are not using vnodes) so the node doesn't skip
bootstrapping.  This is the default, but felt worth mentioning.

You can also remove the dead node, which should stream data to
replicas that will pick up new ranges, but you also will want to do
repairs ahead of time too.  To be honest it's not something I've done
recently, so I'm not as confident on executing that procedure.

Thanks,
Andy


On Mon, Jan 16, 2023 at 9:28 AM Joe Obernberger
<joseph.obernber...@gmail.com> wrote:

Hi all - what is the correct procedure when handling a failed disk?
Have a node in a 15 node cluster.  This node has 16 drives and cassandra
data is split across them.  One drive is failing.  Can I just remove it
from the list and cassandra will then replicate? If not - what?
Thank you!

-Joe


--
This email has been checked for viruses by AVG antivirus software.
www.avg.com

--
This email has been checked for viruses by AVG antivirus software.
www.avg.com


--
This email has been checked for viruses by AVG antivirus software.
www.avg.com

Re: Failed disks - correct procedure

Reply via email to