Re: Failed disks - correct procedure

2023-01-17 Thread C. Scott Andreas
Bumping this note from Andy downthread to make sure everyone has seen it and is aware:“Before you do that, you will want to make sure a cycle of repairs has run on the replicas of the down node to ensure they are consistent with each other.”When replacing an instance, it’s necessary to run repair (incremental or full) among the surviving replicas *before* bootstrapping a replacement instance in. If you don’t do this, Cassandra’s quorum consistency guarantees won’t be met and data may appear to be lost. It’s not possible to use Cassandra as a consistent database without doing so.Given replicas A, B, C, and replacement replica A*:- Quorum write is witnessed by A, B- A fails- A* is bootstrapped in without repair of B, C- Quorum read succeeds against A*, C- The successful quorum read will not observe data from the previous successful quorum write and the data will appear to be lost.Repairing surviving replicas before bootstrapping a replacement node is necessary to avoid this.— ScottOn Jan 17, 2023, at 7:28 AM, Joe Obernberger  wrote:

  

  
  
I come from the hadoop world where we have a cluster with
  probably over 500 drives.  Drives fail all the time; or well
  several a year anyway.  We remove that single drive from HDFS,
  HDFS re-balances, and when we get around to it, we swap in a new
  drive, format it, and add it back to HDFS.  We keep the OS drives
  separate from the data drives and ensure that the OS volume is in
  a RAID mirror.  It's painful when OS drives fail, so mirror
  works.  When space is low, we add another node with lots of disks.
  We are repurposing this same hardware to run a large Cassandra
  cluster.  I'd love it if Cassandra could support larger individual
  nodes, but we've been trying to configure it with lots of disks
  for redundancy, with the idea that we won't use an entire nodes
  storage only for Cassandra.  As was mentioned a long while back,
  blades seem to make more sense for Cassandra than single nodes
  with lots of disk, but we've got what we've got!
  :)
So far, no issues with:
  Stop node, remove drive from cassandra config, start node, run
  repair - version 4.1.

-Joe

On 1/17/2023 10:11 AM, Durity, Sean R
  via user wrote:


  
  
  
  
For physical
hardware when disks fail, I do a removenode, wait for the
drive to be replaced, reinstall Cassandra, and then
bootstrap the node back in (and run clean-up across the DC).
 
All of our
disks are presented as one file system for data, which is
not what the original question was asking.
 
Sean R.
Durity

  From: Marc Hoppins
  
  
  Sent: Tuesday, January 17, 2023 3:57 AM
  To: user@cassandra.apache.org
  Subject: [EXTERNAL] RE: Failed disks - correct
  procedure

 

  HI all, I was
  pondering this very situation. We have a node with a
  crapped-out disk (not the first time). Removenode vs
  repairnode: in regard time, there is going to be little
  difference twixt replacing a dead node and removing then
  re-installing
  


  
   
  
INTERNAL USE

HI all,
 
I was pondering this very situation.
 
We have a node with a crapped-out disk (not the first time). Removenode vs repairnode: in regard time, there is going to be little difference twixt replacing a dead node and removing then re-installing a node.  There is going to be a bunch of reads/writes and verifications (or similar) which is going to take a similar amount of time...or do I read that wrong? 
 
For myself, I just go with removenode and then rejoin after HDD has bee replaced.  Usually the fix exceeds the wait time and the node is then out of the system anyway.
 
-Original Message-
From: Joe Obernberger  
Sent: Monday, January 16, 2023 6:31 PM
To: Jeff Jirsa ; user@cassandra.apache.org
Subject: Re: Failed disks - correct procedure
 
EXTERNAL
 
 
I'm using 4.1.0-1.
I've been doing a lot of truncates lately before the drive failed (research project).  Current drives have about 100GBytes of data each, although the actual amount of data in Cassandra is much less (because of truncates and snapshots).  The cluster is not homo-genius; some nodes have more drives than others.
 
nodetool status -r
Datacenter: datacenter1
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address 

Re: Failed disks - correct procedure

2023-01-17 Thread Joe Obernberger
I come from the hadoop world where we have a cluster with probably over 
500 drives.  Drives fail all the time; or well several a year anyway.  
We remove that single drive from HDFS, HDFS re-balances, and when we get 
around to it, we swap in a new drive, format it, and add it back to 
HDFS.  We keep the OS drives separate from the data drives and ensure 
that the OS volume is in a RAID mirror.  It's painful when OS drives 
fail, so mirror works.  When space is low, we add another node with lots 
of disks.
We are repurposing this same hardware to run a large Cassandra cluster.  
I'd love it if Cassandra could support larger individual nodes, but 
we've been trying to configure it with lots of disks for redundancy, 
with the idea that we won't use an entire nodes storage only for 
Cassandra.  As was mentioned a long while back, blades seem to make more 
sense for Cassandra than single nodes with lots of disk, but we've got 
what we've got!

:)

So far, no issues with:
Stop node, remove drive from cassandra config, start node, run repair - 
version 4.1.


-Joe

On 1/17/2023 10:11 AM, Durity, Sean R via user wrote:


For physical hardware when disks fail, I do a removenode, wait for the 
drive to be replaced, reinstall Cassandra, and then bootstrap the node 
back in (and run clean-up across the DC).


All of our disks are presented as one file system for data, which is 
not what the original question was asking.


Sean R. Durity

*From:*Marc Hoppins 
*Sent:* Tuesday, January 17, 2023 3:57 AM
*To:* user@cassandra.apache.org
*Subject:* [EXTERNAL] RE: Failed disks - correct procedure

HI all, I was pondering this very situation. We have a node with a 
crapped-out disk (not the first time). Removenode vs repairnode: in 
regard time, there is going to be little difference twixt replacing a 
dead node and removing then re-installing


INTERNAL USE

HI all,
I was pondering this very situation.
We have a node with a crapped-out disk (not the first time). 
Removenode vs repairnode: in regard time, there is going to be little 
difference twixt replacing a dead node and removing then re-installing 
a node.  There is going to be a bunch of reads/writes and 
verifications (or similar) which is going to take a similar amount of 
time...or do I read that wrong?
For myself, I just go with removenode and then rejoin after HDD has 
bee replaced.  Usually the fix exceeds the wait time and the node is 
then out of the system anyway.

-Original Message-
From: Joe Obernberger 
Sent: Monday, January 16, 2023 6:31 PM
To: Jeff Jirsa ; user@cassandra.apache.org
Subject: Re: Failed disks - correct procedure
EXTERNAL
I'm using 4.1.0-1.
I've been doing a lot of truncates lately before the drive failed 
(research project).  Current drives have about 100GBytes of data each, 
although the actual amount of data in Cassandra is much less (because 
of truncates and snapshots).  The cluster is not homo-genius; some 
nodes have more drives than others.

nodetool status -r
Datacenter: datacenter1
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address Load   Tokens  Owns  Host
ID   Rack
UN  nyx.querymasters.com    7.9 GiB    250 ?
07bccfce-45f1-41a3-a5c4-ee748a7a9b98  rack1
UN  enceladus.querymasters.com  6.34 GiB   200 ?
274a6e8d-de37-4e0b-b000-02d221d858a5  rack1
UN  aion.querymasters.com   6.31 GiB   200 ?
59150c47-274a-46fb-9d5e-bed468d36797  rack1
UN  calypso.querymasters.com    6.26 GiB   200 ?
e83aa851-69b4-478f-88f6-60e657ea6539  rack1
UN  fortuna.querymasters.com    7.1 GiB    200 ?
49e4f571-7d1c-4e1e-aca7-5bbe076596f7  rack1
UN  kratos.querymasters.com 6.36 GiB   200 ?
0d9509cc-2f23-4117-a883-469a1be54baf  rack1
UN  charon.querymasters.com 6.35 GiB   200 ?
d9702f96-256e-45ae-8e12-69a42712be50  rack1
UN  eros.querymasters.com   6.4 GiB    200 ?
93f9cb0f-ea71-4e3d-b62a-f0ea0e888c47  rack1
UN  ursula.querymasters.com 6.24 GiB   200 ?
4bbbe57c-6219-41e5-bbac-de92a9594d53  rack1
UN  gaia.querymasters.com   6.28 GiB   200 ?
b2e5366e-8386-40ec-a641-27944a5a7cfa  rack1
UN  chaos.querymasters.com  3.78 GiB   120 ?
08a19658-40be-4e55-8709-812b3d4ac750  rack1
UN  pallas.querymasters.com 6.24 GiB   200 ?
b74b6e65-af63-486a-b07f-9e304ec30a39  rack1
UN  paradigm7.querymasters.com  16.25 GiB  500 ?
1ccd2cc5-3ee5-43c5-a8c3-7065bdc24297  rack1
UN  aether.querymasters.com 6.36 GiB   200 ?
352fd049-32f8-4be8-9275-68b145ac2832  rack1
UN  athena.querymasters.com 15.85 GiB  500 ?
b088a8e6-42f3-4331-a583-47ef5149598f  rack1
-Joe
On 1/16/2023 12:23 PM, Jeff Jirsa wrote:
> Prior to cassandra-6696 you’d have to treat one missing disk as a 
> failed machine, wipe all the data and re-stream it, as a tombstone for 
> a given value may be on one disk and data on another (effectively 
> redirecting data)

>
> So the answer has to be version dependent, too - which version 

RE: Failed disks - correct procedure

2023-01-17 Thread Durity, Sean R via user
For physical hardware when disks fail, I do a removenode, wait for the drive to 
be replaced, reinstall Cassandra, and then bootstrap the node back in (and run 
clean-up across the DC).

All of our disks are presented as one file system for data, which is not what 
the original question was asking.

Sean R. Durity
From: Marc Hoppins 
Sent: Tuesday, January 17, 2023 3:57 AM
To: user@cassandra.apache.org
Subject: [EXTERNAL] RE: Failed disks - correct procedure

HI all, I was pondering this very situation. We have a node with a crapped-out 
disk (not the first time). Removenode vs repairnode: in regard time, there is 
going to be little difference twixt replacing a dead node and removing then 
re-installing



INTERNAL USE

HI all,



I was pondering this very situation.



We have a node with a crapped-out disk (not the first time). Removenode vs 
repairnode: in regard time, there is going to be little difference twixt 
replacing a dead node and removing then re-installing a node.  There is going 
to be a bunch of reads/writes and verifications (or similar) which is going to 
take a similar amount of time...or do I read that wrong?



For myself, I just go with removenode and then rejoin after HDD has bee 
replaced.  Usually the fix exceeds the wait time and the node is then out of 
the system anyway.



-Original Message-

From: Joe Obernberger 
mailto:joseph.obernber...@gmail.com>>

Sent: Monday, January 16, 2023 6:31 PM

To: Jeff Jirsa mailto:jji...@gmail.com>>; 
user@cassandra.apache.org

Subject: Re: Failed disks - correct procedure



EXTERNAL





I'm using 4.1.0-1.

I've been doing a lot of truncates lately before the drive failed (research 
project).  Current drives have about 100GBytes of data each, although the 
actual amount of data in Cassandra is much less (because of truncates and 
snapshots).  The cluster is not homo-genius; some nodes have more drives than 
others.



nodetool status -r

Datacenter: datacenter1

===

Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving

--  Address Load   Tokens  Owns  Host

ID   Rack

UN  nyx.querymasters.com7.9 GiB250 ?

07bccfce-45f1-41a3-a5c4-ee748a7a9b98  rack1

UN  enceladus.querymasters.com  6.34 GiB   200 ?

274a6e8d-de37-4e0b-b000-02d221d858a5  rack1

UN  aion.querymasters.com   6.31 GiB   200 ?

59150c47-274a-46fb-9d5e-bed468d36797  rack1

UN  calypso.querymasters.com6.26 GiB   200 ?

e83aa851-69b4-478f-88f6-60e657ea6539  rack1

UN  fortuna.querymasters.com7.1 GiB200 ?

49e4f571-7d1c-4e1e-aca7-5bbe076596f7  rack1

UN  kratos.querymasters.com 6.36 GiB   200 ?

0d9509cc-2f23-4117-a883-469a1be54baf  rack1

UN  charon.querymasters.com 6.35 GiB   200 ?

d9702f96-256e-45ae-8e12-69a42712be50  rack1

UN  eros.querymasters.com   6.4 GiB200 ?

93f9cb0f-ea71-4e3d-b62a-f0ea0e888c47  rack1

UN  ursula.querymasters.com 6.24 GiB   200 ?

4bbbe57c-6219-41e5-bbac-de92a9594d53  rack1

UN  gaia.querymasters.com   6.28 GiB   200 ?

b2e5366e-8386-40ec-a641-27944a5a7cfa  rack1

UN  chaos.querymasters.com  3.78 GiB   120 ?

08a19658-40be-4e55-8709-812b3d4ac750  rack1

UN  pallas.querymasters.com 6.24 GiB   200 ?

b74b6e65-af63-486a-b07f-9e304ec30a39  rack1

UN  paradigm7.querymasters.com  16.25 GiB  500 ?

1ccd2cc5-3ee5-43c5-a8c3-7065bdc24297  rack1

UN  aether.querymasters.com 6.36 GiB   200 ?

352fd049-32f8-4be8-9275-68b145ac2832  rack1

UN  athena.querymasters.com 15.85 GiB  500 ?

b088a8e6-42f3-4331-a583-47ef5149598f  rack1



-Joe



On 1/16/2023 12:23 PM, Jeff Jirsa wrote:

> Prior to cassandra-6696 you’d have to treat one missing disk as a

> failed machine, wipe all the data and re-stream it, as a tombstone for

> a given value may be on one disk and data on another (effectively

> redirecting data)

>

> So the answer has to be version dependent, too - which version were you using?

>

>> On Jan 16, 2023, at 9:08 AM, Tolbert, Andy 
>> mailto:x...@andrewtolbert.com>> wrote:

>>

>> Hi Joe,

>>

>> Reading it back I realized I misunderstood that part of your email,

>> so you must be using data_file_directories with 16 drives?  That's a

>> lot of drives!  I imagine this may happen from time to time given

>> that disks like to fail.

>>

>> That's a bit of an interesting scenario that I would have to think

>> about.  If you brought the node up without the bad drive, repairs are

>> probably going to do a ton of repair overstreaming if you aren't

>> using

>> 4.0 
>> (https://urldefense.com/v3/__https://issues.apache.org/jira/browse/CASSANDRA-3200__;!!M-nmYVHPHQ!NmM-JuBX-GTYHt0XeaEWNz7saGfIvnRUEAy3HG6hX_i0bdaIzpo4ceBTx-mB1K9PsPJhfCb0ZCrgVxL7EkOS5AaQVTU$

RE: Failed disks - correct procedure

2023-01-17 Thread Marc Hoppins
HI all,

I was pondering this very situation.

We have a node with a crapped-out disk (not the first time). Removenode vs 
repairnode: in regard time, there is going to be little difference twixt 
replacing a dead node and removing then re-installing a node.  There is going 
to be a bunch of reads/writes and verifications (or similar) which is going to 
take a similar amount of time...or do I read that wrong? 

For myself, I just go with removenode and then rejoin after HDD has bee 
replaced.  Usually the fix exceeds the wait time and the node is then out of 
the system anyway.

-Original Message-
From: Joe Obernberger  
Sent: Monday, January 16, 2023 6:31 PM
To: Jeff Jirsa ; user@cassandra.apache.org
Subject: Re: Failed disks - correct procedure

EXTERNAL


I'm using 4.1.0-1.
I've been doing a lot of truncates lately before the drive failed (research 
project).  Current drives have about 100GBytes of data each, although the 
actual amount of data in Cassandra is much less (because of truncates and 
snapshots).  The cluster is not homo-genius; some nodes have more drives than 
others.

nodetool status -r
Datacenter: datacenter1
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address Load   Tokens  Owns  Host
ID   Rack
UN  nyx.querymasters.com7.9 GiB250 ?
07bccfce-45f1-41a3-a5c4-ee748a7a9b98  rack1
UN  enceladus.querymasters.com  6.34 GiB   200 ?
274a6e8d-de37-4e0b-b000-02d221d858a5  rack1
UN  aion.querymasters.com   6.31 GiB   200 ?
59150c47-274a-46fb-9d5e-bed468d36797  rack1
UN  calypso.querymasters.com6.26 GiB   200 ?
e83aa851-69b4-478f-88f6-60e657ea6539  rack1
UN  fortuna.querymasters.com7.1 GiB200 ?
49e4f571-7d1c-4e1e-aca7-5bbe076596f7  rack1
UN  kratos.querymasters.com 6.36 GiB   200 ?
0d9509cc-2f23-4117-a883-469a1be54baf  rack1
UN  charon.querymasters.com 6.35 GiB   200 ?
d9702f96-256e-45ae-8e12-69a42712be50  rack1
UN  eros.querymasters.com   6.4 GiB200 ?
93f9cb0f-ea71-4e3d-b62a-f0ea0e888c47  rack1
UN  ursula.querymasters.com 6.24 GiB   200 ?
4bbbe57c-6219-41e5-bbac-de92a9594d53  rack1
UN  gaia.querymasters.com   6.28 GiB   200 ?
b2e5366e-8386-40ec-a641-27944a5a7cfa  rack1
UN  chaos.querymasters.com  3.78 GiB   120 ?
08a19658-40be-4e55-8709-812b3d4ac750  rack1
UN  pallas.querymasters.com 6.24 GiB   200 ?
b74b6e65-af63-486a-b07f-9e304ec30a39  rack1
UN  paradigm7.querymasters.com  16.25 GiB  500 ?
1ccd2cc5-3ee5-43c5-a8c3-7065bdc24297  rack1
UN  aether.querymasters.com 6.36 GiB   200 ?
352fd049-32f8-4be8-9275-68b145ac2832  rack1
UN  athena.querymasters.com 15.85 GiB  500 ?
b088a8e6-42f3-4331-a583-47ef5149598f  rack1

-Joe

On 1/16/2023 12:23 PM, Jeff Jirsa wrote:
> Prior to cassandra-6696 you’d have to treat one missing disk as a 
> failed machine, wipe all the data and re-stream it, as a tombstone for 
> a given value may be on one disk and data on another (effectively 
> redirecting data)
>
> So the answer has to be version dependent, too - which version were you using?
>
>> On Jan 16, 2023, at 9:08 AM, Tolbert, Andy  wrote:
>>
>> Hi Joe,
>>
>> Reading it back I realized I misunderstood that part of your email, 
>> so you must be using data_file_directories with 16 drives?  That's a 
>> lot of drives!  I imagine this may happen from time to time given 
>> that disks like to fail.
>>
>> That's a bit of an interesting scenario that I would have to think 
>> about.  If you brought the node up without the bad drive, repairs are 
>> probably going to do a ton of repair overstreaming if you aren't 
>> using
>> 4.0 (https://issues.apache.org/jira/browse/CASSANDRA-3200) which may 
>> put things into a really bad state (lots of streaming = lots of 
>> compactions = slower reads) and you may be seeing some inconsistency 
>> if repairs weren't regularly running beforehand.
>>
>> How much data was on the drive that failed?  How much data do you 
>> usually have per node?
>>
>> Thanks,
>> Andy
>>
>>> On Mon, Jan 16, 2023 at 10:59 AM Joe Obernberger 
>>>  wrote:
>>>
>>> Thank you Andy.
>>> Is there a way to just remove the drive from the cluster and replace 
>>> it later?  Ordering replacement drives isn't a fast process...
>>> What I've done so far is:
>>> Stop node
>>> Remove drive reference from /etc/cassandra/conf/cassandra.yaml
>>> Restart node
>>> Run repair
>>>
>>> Will that work?  Right now, it's showing all nodes as up.
>>>
>>> -Joe
>>>
 On 1/16/2023 11:55 AM, Tolbert, Andy wrote:
 Hi Joe,

 I'd recommend just doing a replacement, bringing up a new node with 
 -Dcassandra.replace_address_first_boot=ip.you.are.replacing as 
 described here:
 https://cassandra.apache.org/doc/4.1/cassandra/operating/topo_chang
 es.html#replacing-a-dead-node

 Before you do that, you will want to make sure a cycle of repairs 
 has run on the replicas of the down no