[jira] [Updated] (KAFKA-14139) Replaced disk can lead to loss of committed data even with non-empty ISR

2023-01-30 Thread Alexandre Dupriez (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexandre Dupriez updated KAFKA-14139:
--
Description: 
We have been thinking about disk failure cases recently. Suppose that a disk 
has failed and the user needs to restart the disk from an empty state. The 
concern is whether this can lead to the unnecessary loss of committed data.

For normal topic partitions, removal from the ISR during controlled shutdown 
buys us some protection. After the replica is restarted, it must prove its 
state to the leader before it can be added back to the ISR. And it cannot 
become a leader until it does so.

An obvious exception to this is when the replica is the last member in the ISR. 
In this case, the disk failure itself has compromised the committed data, so 
some amount of loss must be expected.

We have been considering other scenarios in which the loss of one disk can lead 
to data loss even when there are replicas remaining which have all of the 
committed entries. One such scenario is this:

Suppose we have a partition with two replicas: A and B. Initially A is the 
leader and it is the only member of the ISR.
 # Broker B catches up to A, so A attempts to send an AlterPartition request to 
the controller to add B into the ISR.
 # Before the AlterPartition request is received, replica B has a hard failure.
 # The current controller successfully fences broker B. It takes no action on 
this partition since B is already out of the ISR.
 # Before the controller receives the AlterPartition request to add B, it also 
fails.
 # While the new controller is initializing, suppose that replica B finishes 
startup, but the disk has been replaced (all of the previous state has been 
lost).
 # The new controller sees the registration from broker B first.
 # Finally, the AlterPartition from A arrives which adds B back into the ISR 
even though it has an empty log.

(Credit for coming up with this scenario goes to [~junrao] .)

I tested this in KRaft and confirmed that this sequence is possible (even if 
perhaps unlikely). There are a few ways we could have potentially detected the 
issue. First, perhaps the leader should have bumped the leader epoch on all 
partitions when B was fenced. Then the inflight AlterPartition would be doomed 
no matter when it arrived.

Alternatively, we could have relied on the broker epoch to distinguish the dead 
broker's state from that of the restarted broker. This could be done by 
including the broker epoch in both the `Fetch` request and in `AlterPartition`.

Finally, perhaps even normal kafka replication should be using a unique 
identifier for each disk so that we can reliably detect when it has changed. 
For example, something like what was proposed for the metadata quorum here: 
[https://cwiki.apache.org/confluence/display/KAFKA/KIP-853%3A+KRaft+Voter+Changes].

  was:
We have been thinking about disk failure cases recently. Suppose that a disk 
has failed and the user needs to restart the disk from an empty state. The 
concern is whether this can lead to the unnecessary loss of committed data.

For normal topic partitions, removal from the ISR during controlled shutdown 
buys us some protection. After the replica is restarted, it must prove its 
state to the leader before it can be added back to the ISR. And it cannot 
become a leader until it does so.

An obvious exception to this is when the replica is the last member in the ISR. 
In this case, the disk failure itself has compromised the committed data, so 
some amount of loss must be expected.

We have been considering other scenarios in which the loss of one disk can lead 
to data loss even when there are replicas remaining which have all of the 
committed entries. One such scenario is this:

Suppose we have a partition with two replicas: A and B. Initially A is the 
leader and it is the only member of the ISR.
 # Broker B catches up to A, so A attempts to send an AlterPartition request to 
the controller to add B into the ISR.
 # Before the AlterPartition request is received, replica B has a hard failure.
 # The current controller successfully fences broker B. It takes no action on 
this partition since B is already out of the ISR.
 # Before the controller receives the AlterPartition request to add B, it also 
fails.
 # While the new controller is initializing, suppose that replica B finishes 
startup, but the disk has been replaced (all of the previous state has been 
lost).
 # The new controller sees the registration from broker B first.
 # Finally, the AlterPartition from A arrives which adds B back into the ISR 
even though it has an empty log.

(Credit for coming up with this scenario goes to [~junrao] .)

I tested this in KRaft and confirmed that this sequence is possible (even if 
perhaps unlikely). There are a few ways we could have potentially detected the 
issue. First, perhaps the leader should have 

[jira] [Updated] (KAFKA-14139) Replaced disk can lead to loss of committed data even with non-empty ISR

2022-12-14 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-14139:
---
Fix Version/s: 3.5.0
   (was: 3.4.0)

> Replaced disk can lead to loss of committed data even with non-empty ISR
> 
>
> Key: KAFKA-14139
> URL: https://issues.apache.org/jira/browse/KAFKA-14139
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Priority: Major
> Fix For: 3.5.0
>
>
> We have been thinking about disk failure cases recently. Suppose that a disk 
> has failed and the user needs to restart the disk from an empty state. The 
> concern is whether this can lead to the unnecessary loss of committed data.
> For normal topic partitions, removal from the ISR during controlled shutdown 
> buys us some protection. After the replica is restarted, it must prove its 
> state to the leader before it can be added back to the ISR. And it cannot 
> become a leader until it does so.
> An obvious exception to this is when the replica is the last member in the 
> ISR. In this case, the disk failure itself has compromised the committed 
> data, so some amount of loss must be expected.
> We have been considering other scenarios in which the loss of one disk can 
> lead to data loss even when there are replicas remaining which have all of 
> the committed entries. One such scenario is this:
> Suppose we have a partition with two replicas: A and B. Initially A is the 
> leader and it is the only member of the ISR.
>  # Broker B catches up to A, so A attempts to send an AlterPartition request 
> to the controller to add B into the ISR.
>  # Before the AlterPartition request is received, replica B has a hard 
> failure.
>  # The current controller successfully fences broker B. It takes no action on 
> this partition since B is already out of the ISR.
>  # Before the controller receives the AlterPartition request to add B, it 
> also fails.
>  # While the new controller is initializing, suppose that replica B finishes 
> startup, but the disk has been replaced (all of the previous state has been 
> lost).
>  # The new controller sees the registration from broker B first.
>  # Finally, the AlterPartition from A arrives which adds B back into the ISR 
> even though it has an empty log.
> (Credit for coming up with this scenario goes to [~junrao] .)
> I tested this in KRaft and confirmed that this sequence is possible (even if 
> perhaps unlikely). There are a few ways we could have potentially detected 
> the issue. First, perhaps the leader should have bumped the leader epoch on 
> all partitions when B was fenced. Then the inflight AlterPartition would be 
> doomed no matter when it arrived.
> Alternatively, we could have relied on the broker epoch to distinguish the 
> dead broker's state from that of the restarted broker. This could be done by 
> including the broker epoch in both the `Fetch` request and in 
> `AlterPartition`.
> Finally, perhaps even normal kafka replication should be using a unique 
> identifier for each disk so that we can reliably detect when it has changed. 
> For example, something like what was proposed for the metadata quorum here: 
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-853%3A+KRaft+Voter+Changes.]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-14139) Replaced disk can lead to loss of committed data even with non-empty ISR

2022-08-05 Thread Jose Armando Garcia Sancio (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jose Armando Garcia Sancio updated KAFKA-14139:
---
Fix Version/s: 3.4.0

> Replaced disk can lead to loss of committed data even with non-empty ISR
> 
>
> Key: KAFKA-14139
> URL: https://issues.apache.org/jira/browse/KAFKA-14139
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Priority: Major
> Fix For: 3.4.0
>
>
> We have been thinking about disk failure cases recently. Suppose that a disk 
> has failed and the user needs to restart the disk from an empty state. The 
> concern is whether this can lead to the unnecessary loss of committed data.
> For normal topic partitions, removal from the ISR during controlled shutdown 
> buys us some protection. After the replica is restarted, it must prove its 
> state to the leader before it can be added back to the ISR. And it cannot 
> become a leader until it does so.
> An obvious exception to this is when the replica is the last member in the 
> ISR. In this case, the disk failure itself has compromised the committed 
> data, so some amount of loss must be expected.
> We have been considering other scenarios in which the loss of one disk can 
> lead to data loss even when there are replicas remaining which have all of 
> the committed entries. One such scenario is this:
> Suppose we have a partition with two replicas: A and B. Initially A is the 
> leader and it is the only member of the ISR.
>  # Broker B catches up to A, so A attempts to send an AlterPartition request 
> to the controller to add B into the ISR.
>  # Before the AlterPartition request is received, replica B has a hard 
> failure.
>  # The current controller successfully fences broker B. It takes no action on 
> this partition since B is already out of the ISR.
>  # Before the controller receives the AlterPartition request to add B, it 
> also fails.
>  # While the new controller is initializing, suppose that replica B finishes 
> startup, but the disk has been replaced (all of the previous state has been 
> lost).
>  # The new controller sees the registration from broker B first.
>  # Finally, the AlterPartition from A arrives which adds B back into the ISR 
> even though it has an empty log.
> (Credit for coming up with this scenario goes to [~junrao] .)
> I tested this in KRaft and confirmed that this sequence is possible (even if 
> perhaps unlikely). There are a few ways we could have potentially detected 
> the issue. First, perhaps the leader should have bumped the leader epoch on 
> all partitions when B was fenced. Then the inflight AlterPartition would be 
> doomed no matter when it arrived.
> Alternatively, we could have relied on the broker epoch to distinguish the 
> dead broker's state from that of the restarted broker. This could be done by 
> including the broker epoch in both the `Fetch` request and in 
> `AlterPartition`.
> Finally, perhaps even normal kafka replication should be using a unique 
> identifier for each disk so that we can reliably detect when it has changed. 
> For example, something like what was proposed for the metadata quorum here: 
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-853%3A+KRaft+Voter+Changes.]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-14139) Replaced disk can lead to loss of committed data even with non-empty ISR

2022-08-03 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson updated KAFKA-14139:

Description: 
We have been thinking about disk failure cases recently. Suppose that a disk 
has failed and the user needs to restart the disk from an empty state. The 
concern is whether this can lead to the unnecessary loss of committed data.

For normal topic partitions, removal from the ISR during controlled shutdown 
buys us some protection. After the replica is restarted, it must prove its 
state to the leader before it can be added back to the ISR. And it cannot 
become a leader until it does so.

An obvious exception to this is when the replica is the last member in the ISR. 
In this case, the disk failure itself has compromised the committed data, so 
some amount of loss must be expected.

We have been considering other scenarios in which the loss of one disk can lead 
to data loss even when there are replicas remaining which have all of the 
committed entries. One such scenario is this:

Suppose we have a partition with two replicas: A and B. Initially A is the 
leader and it is the only member of the ISR.
 # Broker B catches up to A, so A attempts to send an AlterPartition request to 
the controller to add B into the ISR.
 # Before the AlterPartition request is received, replica B has a hard failure.
 # The current controller successfully fences broker B. It takes no action on 
this partition since B is already out of the ISR.
 # Before the controller receives the AlterPartition request to add B, it also 
fails.
 # While the new controller is initializing, suppose that replica B finishes 
startup, but the disk has been replaced (all of the previous state has been 
lost).
 # The new controller sees the registration from broker B first.
 # Finally, the AlterPartition from A arrives which adds B back into the ISR 
even though it has an empty log.

(Credit for coming up with this scenario goes to [~junrao] .)

I tested this in KRaft and confirmed that this sequence is possible (even if 
perhaps unlikely). There are a few ways we could have potentially detected the 
issue. First, perhaps the leader should have bumped the leader epoch on all 
partitions when B was fenced. Then the inflight AlterPartition would be doomed 
no matter when it arrived.

Alternatively, we could have relied on the broker epoch to distinguish the dead 
broker's state from that of the restarted broker. This could be done by 
including the broker epoch in both the `Fetch` request and in `AlterPartition`.

Finally, perhaps even normal kafka replication should be using a unique 
identifier for each disk so that we can reliably detect when it has changed. 
For example, something like what was proposed for the metadata quorum here: 
[https://cwiki.apache.org/confluence/display/KAFKA/KIP-853%3A+KRaft+Voter+Changes.]
 

  was:
We have been thinking about disk failure cases recently. Suppose that a disk 
has failed and the user needs to restart the disk from an empty state. The 
concern is whether this can lead to the unnecessary loss of committed data.

For normal topic partitions, removal from the ISR during controlled shutdown 
buys us some protection. After the replica is restarted, it must prove its 
state to the leader before it can be added back to the ISR. And it cannot 
become a leader until it does so.

An obvious exception to this is when the replica is the last member in the ISR. 
In this case, the disk failure itself has compromised the committed data, so 
some amount of loss must be expected.

We have been considering other scenarios in which the loss of one disk can lead 
to data loss even when there are replicas remaining which have all of the 
committed entries. One such scenario is this:

Suppose we have a partition with two replicas: A and B. Initially A is the 
leader and it is the only member of the ISR.
 # Broker B catches up to A, so A attempts to send an AlterPartition request to 
the controller to add B into the ISR.
 # Before the AlterPartition request is received, replica B has a hard failure.
 # The current controller successfully fences broker B. It takes no action on 
this partition since B is already out of the ISR.
 # Before the controller receives the AlterPartition request to add B, it also 
fails.
 # While the new controller is initializing, suppose that replica B finishes 
startup, but the disk has been replaced (all of the previous state has been 
lost).
 # The new controller sees the registration from broker B first.
 # Finally, the AlterPartition from A arrives which adds B back into the ISR.

(Credit for coming up with this scenario goes to [~junrao] .)

I tested this in KRaft and confirmed that this sequence is possible (even if 
perhaps unlikely). There are a few ways we could have potentially detected the 
issue. First, perhaps the leader should have bumped the leader epoch on all