[ 
https://issues.apache.org/jira/browse/KAFKA-15353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Calvin Liu reassigned KAFKA-15353:
----------------------------------

    Assignee: Calvin Liu

> Empty ISR returned from controller after AlterPartition request
> ---------------------------------------------------------------
>
>                 Key: KAFKA-15353
>                 URL: https://issues.apache.org/jira/browse/KAFKA-15353
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 3.5.0
>            Reporter: Luke Chen
>            Assignee: Calvin Liu
>            Priority: Blocker
>             Fix For: 3.6.0, 3.5.2
>
>
> In 
> [KIP-903|https://cwiki.apache.org/confluence/display/KAFKA/KIP-903%3A+Replicas+with+stale+broker+epoch+should+not+be+allowed+to+join+the+ISR],
>  (more specifically this [PR|https://github.com/apache/kafka/pull/13408]), we 
> bumped the AlterPartitionRequest version to 3 to use `NewIsrWithEpochs` field 
> instead of `NewIsr` one. And when building the request for older version, 
> we'll manually convert/downgrade the request into the older version for 
> backward compatibility 
> [here|https://github.com/apache/kafka/blob/6bd17419b76f8cf8d7e4a11c071494dfaa72cd50/clients/src/main/java/org/apache/kafka/common/requests/AlterPartitionRequest.java#L85-L96],
>  to extract ISR info from `NewIsrWithEpochs` and then fill in the `NewIsr` 
> field, and then clear the `NewIsrWithEpochs` field.
>  
> The problem is, when the AlterPartitionRequest sent out for the first time, 
> if there's some transient error (ex: NOT_CONTROLLER), we'll retry. On the 
> retry, we'll build the AlterPartitionRequest again. But this time, the 
> request data is the one that already converted above. At this point, when we 
> try to extract the ISR from `NewIsrWithEpochs`, we'll get empty. So, we'll 
> send out an AlterPartition request with empty ISR, and impacting the kafka 
> availability.
>  
> From the log, I can see this:
> {code:java}
> [2023-08-16 03:57:55,122] INFO [Partition test_topic-1 broker=3] ISR updated 
> to  (under-min-isr) and version updated to 9 (kafka.cluster.Partition)
> ...
> [2023-08-16 03:57:55,157] ERROR [ReplicaManager broker=3] Error processing 
> append operation on partition test_topic-1 
> (kafka.server.ReplicaManager)org.apache.kafka.common.errors.NotEnoughReplicasException:
>  The size of the current ISR Set() is insufficient to satisfy the min.isr 
> requirement of 2 for partition test_topic-1 {code}
>  
> h4. *Impact:*
> This will happen when users trying to upgrade from versions < 3.5.0 to 3.5.0 
> or later. During the rolling upgrade, there will be some nodes in v3.5.0, and 
> some are not. So, for the node in v3.5.0 will try to build an old version of 
> AlterPartitionRequest. And then, if it happen to have some transient error 
> during the AlterPartitionRequest send, the ISR will be empty and no producers 
> will be able to write data to the partitions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to