Re: [DISCUSS] Enhanced Disk Error Handling

2023-03-09 Thread Bowen Song via dev
From an operator's view, I think the most reliable indicator is not the total count of corruption events, but the frequency of the events. Let me try to explain that over some examples: 1. many corruption events in short period of time, then nothing after that The disk is probably still

Re: [DISCUSS] Enhanced Disk Error Handling

2023-03-09 Thread Bowen Song via dev
/When we attempt to rectify any bit-error by streaming data from peers, we implicitly take a lock on token ownership. A user needs to know that it is unsafe to change token ownership in a cluster that is currently in the process of repairing a corruption error on one of its

Re: [DISCUSS] Enhanced Disk Error Handling

2023-03-09 Thread Abe Ratnofsky
> there's a point at which a host limping along is better put down and replaced I did a basic literature review and it looks like load (total program-erase cycles), disk age, and operating temperature all lead to BER increases. We don't need to build a whole model of disk failure, we could

Re: [DISCUSS] Enhanced Disk Error Handling

2023-03-09 Thread Josh McKenzie
> I'm not seeing any reasons why CEP-21 would make this more difficult to > implement I think I communicated poorly - I was just trying to point out that there's a point at which a host limping along is better put down and replaced than piecemeal flagging range after range dead and working

Re: [DISCUSS] Enhanced Disk Error Handling

2023-03-09 Thread Abe Ratnofsky
I'm not seeing any reasons why CEP-21 would make this more difficult to implement, besides the fact that it hasn't landed yet. There are two major potential pitfalls that CEP-21 would help us avoid: 1. Bit-errors beget further bit-errors, so we ought to be resistant to a high frequency of

Re: [DISCUSS] Enhanced Disk Error Handling

2023-03-09 Thread Josh McKenzie
> Personally, I'd like to see the fix for this issue come after CEP-21. It > could be feasible to implement a fix before then, that detects bit-errors on > the read path and refuses to respond to the coordinator, implicitly having > speculative execution handle the retry against another replica

Re: [DISCUSS] Enhanced Disk Error Handling

2023-03-09 Thread Abe Ratnofsky
Thanks for proposing this discussion Bowen. I see a few different issues here: 1. How do we safely handle corruption of a handful of tokens without taking an entire instance offline for re-bootstrap? This includes refusal to serve read requests for the corrupted token(s), and correct repair of

Re: [DISCUSS] Enhanced Disk Error Handling

2023-03-09 Thread Bowen Song via dev
Hi Jeremiah, I'm fully aware of that, which is why I said that deleting the affected SSTable files is "less safe". If the "bad blocks" logic is implemented and the node abort the current read query when hitting a bad block, it should remain safe, as the data in other SSTable files will not

Re: [DISCUSS] Enhanced Disk Error Handling

2023-03-09 Thread Jeremiah D Jordan
It is actually more complicated than just removing the sstable and running repair. In the face of expired tombstones that might be covering data in other sstables the only safe way to deal with a bad sstable is wipe the token range in the bad sstable and rebuild/bootstrap that range (or

Re: [DISCUSS] Enhanced Disk Error Handling

2023-03-08 Thread Jeff Jirsa
On Wed, Mar 8, 2023 at 5:25 AM Bowen Song via dev wrote: > At the moment, when a read error, such as unrecoverable bit error or data > corruption, occurs in the SSTable data files, regardless of the > disk_failure_policy configuration, manual (or to be precise, external) > intervention is

Re: [DISCUSS] Enhanced Disk Error Handling

2023-03-08 Thread Bowen Song via dev
/– A repair of the affected range would need to be completed among the replicas without such corruption (including paxos repair)./ It can be safe without a repair by over-streaming the data from more (or all) available replicas, either within the DC (when LOCAL_* CL is used) or across

Re: [DISCUSS] Enhanced Disk Error Handling

2023-03-08 Thread C. Scott Andreas
Realized I’m somewhat mistaken here - The repair of surviving replicas would be necessary for correctness prior to the node with deleted data files to be able to serve client/internode reads. But the repair of the node with deleted data files prior to being brought back into the cluster is

Re: [DISCUSS] Enhanced Disk Error Handling

2023-03-08 Thread C. Scott Andreas
For this to be safe, my understanding is that:– A repair of the affected range would need to be completed among the replicas without such corruption (including paxos repair).– And we'd need a mechanism to execute repair on the affected node without it being available to respond to queries,

[DISCUSS] Enhanced Disk Error Handling

2023-03-08 Thread Bowen Song via dev
At the moment, when a read error, such as unrecoverable bit error or data corruption, occurs in the SSTable data files, regardless of the disk_failure_policy configuration, manual (or to be precise, external) intervention is required to recover from the error. Commonly, there's two approach