New episode of The Apache Cassandra (R) Corner podcast!

2023-03-08 Thread Aaron Ploetz
Link to the next episode: https://drive.google.com/file/d/1_EOBpG3yiuptDJ-PU-3a7amSVvi7pgM8/view?usp=sharing s2Ep2 - Aaron Morton (You may have to download it to listen) It will remain in staging for 72 hours, going live (assuming no objections) by Saturday, March 11th (22:00 UTC). If anyone

Re: [DISCUSS] Enhanced Disk Error Handling

2023-03-08 Thread Jeff Jirsa
On Wed, Mar 8, 2023 at 5:25 AM Bowen Song via dev wrote: > At the moment, when a read error, such as unrecoverable bit error or data > corruption, occurs in the SSTable data files, regardless of the > disk_failure_policy configuration, manual (or to be precise, external) > intervention is

Re: [DISCUSS] Enhanced Disk Error Handling

2023-03-08 Thread Bowen Song via dev
/– A repair of the affected range would need to be completed among the replicas without such corruption (including paxos repair)./ It can be safe without a repair by over-streaming the data from more (or all) available replicas, either within the DC (when LOCAL_* CL is used) or across

Re: [DISCUSS] Enhanced Disk Error Handling

2023-03-08 Thread C. Scott Andreas
Realized I’m somewhat mistaken here - The repair of surviving replicas would be necessary for correctness prior to the node with deleted data files to be able to serve client/internode reads. But the repair of the node with deleted data files prior to being brought back into the cluster is

Re: [DISCUSS] Enhanced Disk Error Handling

2023-03-08 Thread C. Scott Andreas
For this to be safe, my understanding is that:– A repair of the affected range would need to be completed among the replicas without such corruption (including paxos repair).– And we'd need a mechanism to execute repair on the affected node without it being available to respond to queries,

[DISCUSS] Enhanced Disk Error Handling

2023-03-08 Thread Bowen Song via dev
At the moment, when a read error, such as unrecoverable bit error or data corruption, occurs in the SSTable data files, regardless of the disk_failure_policy configuration, manual (or to be precise, external) intervention is required to recover from the error. Commonly, there's two approach