[jira] [Commented] (KAFKA-16692) InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not enabled when upgrading from kafka 3.5 to 3.6

2024-05-17 Thread Johnson Okorie (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847391#comment-17847391
 ] 

Johnson Okorie commented on KAFKA-16692:


Hi [~jolshan], I wanted to express my sincere thanks for your quick response 
and prompt fix for this issue. One of the reasons I love this project! Keep up 
the great work!

> InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not 
> enabled when upgrading from kafka 3.5 to 3.6 
> 
>
> Key: KAFKA-16692
> URL: https://issues.apache.org/jira/browse/KAFKA-16692
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 3.7.0, 3.6.1, 3.8
>Reporter: Johnson Okorie
>Assignee: Justine Olshan
>Priority: Major
>
> We have a kafka cluster running on version 3.5.2 that we are upgrading to 
> 3.6.1. This cluster has a lot of clients with exactly one semantics enabled 
> and hence creating transactions. As we replaced brokers with the new 
> binaries, we observed lots of clients in the cluster experiencing the 
> following error:
> {code:java}
> 2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=, 
> transactionalId=] Got error produce response with 
> correlation id 6402937 on topic-partition , retrying 
> (2147483512 attempts left). Error: NETWORK_EXCEPTION. Error Message: The 
> server disconnected before a response was received.{code}
> On inspecting the broker, we saw the following errors on brokers still 
> running Kafka version 3.5.2:
>  
> {code:java}
> message:     
> Closing socket for  because of error
> exception_exception_class:    
> org.apache.kafka.common.errors.InvalidRequestException
> exception_exception_message:    
> Received request api key ADD_PARTITIONS_TO_TXN with version 4 which is not 
> enabled
> exception_stacktrace:    
> org.apache.kafka.common.errors.InvalidRequestException: Received request api 
> key ADD_PARTITIONS_TO_TXN with version 4 which is not enabled
> {code}
> On the new brokers running 3.6.1 we saw the following errors:
>  
> {code:java}
> [AddPartitionsToTxnSenderThread-1055]: AddPartitionsToTxnRequest failed for 
> node 1043 with a network exception.{code}
>  
> I can also see this :
> {code:java}
> [AddPartitionsToTxnManager broker=1055]Cancelled in-flight 
> ADD_PARTITIONS_TO_TXN request with correlation id 21120 due to node 1043 
> being disconnected (elapsed time since creation: 11ms, elapsed time since 
> send: 4ms, request timeout: 3ms){code}
> We started investigating this issue and digging through the changes in 3.6, 
> we came across some changes introduced as part of KAFKA-14402 that we thought 
> might lead to this behaviour. 
> First we could see that _transaction.partition.verification.enable_ is 
> enabled by default and enables a new code path that culminates in we sending 
> version 4 ADD_PARTITIONS_TO_TXN requests to other brokers that are generated 
> [here|https://github.com/apache/kafka/blob/29f3260a9c07e654a28620aeb93a778622a5233d/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L269].
> From a 
> [discussion|https://lists.apache.org/thread/4895wrd1z92kjb708zck4s1f62xq6r8x] 
> on the mailing list, [~jolshan] pointed out that this scenario shouldn't be 
> possible as the following code paths should prevent version 4 
> ADD_PARTITIONS_TO_TXN requests being sent to other brokers:
> [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/clients/src/main/java/org/apache/kafka/clients/NodeApiVersions.java#L130]
>  
> [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L195]
> However, these requests are still sent to other brokers in our environment.
> On further inspection of the code, I am wondering if the following code path 
> could lead to this issue:
> [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L500]
> In this scenario, we don't have any _NodeApiVersions_ available for the 
> specified nodeId and potentially skipping the _latestUsableVersion_ check. I 
> am wondering if it is possible that because _discoverBrokerVersions_ is set 
> to _false_ for the network client of the {_}AddPartitionsToTxnManager{_}, it 
> skips fetching {_}NodeApiVersions{_}? I can see that we create the network 
> client here:
> [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/core/src/main/scala/kafka/server/KafkaServer.scala#L641]
> The _NetworkUtils.buildNetworkClient_ method seems to create a network client 
> that has _discoverBrokerVersions_ set to {_}false{_}. 
> I was hoping I could get some assistance debugging 

[jira] [Comment Edited] (KAFKA-16692) InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not enabled when upgrading from kafka 3.5 to 3.6

2024-05-10 Thread Johnson Okorie (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845380#comment-17845380
 ] 

Johnson Okorie edited comment on KAFKA-16692 at 5/10/24 2:44 PM:
-

Thanks [~jolshan], looking forward to see your findings!


was (Author: JIRAUSER305348):
Thanks [~jolshan], looking forward to see you findings!

> InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not 
> enabled when upgrading from kafka 3.5 to 3.6 
> 
>
> Key: KAFKA-16692
> URL: https://issues.apache.org/jira/browse/KAFKA-16692
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 3.6.1
>Reporter: Johnson Okorie
>Assignee: Justine Olshan
>Priority: Major
>
> We have a kafka cluster running on version 3.5.2 that we are upgrading to 
> 3.6.1. This cluster has a lot of clients with exactly one semantics enabled 
> and hence creating transactions. As we replaced brokers with the new 
> binaries, we observed lots of clients in the cluster experiencing the 
> following error:
> {code:java}
> 2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=, 
> transactionalId=] Got error produce response with 
> correlation id 6402937 on topic-partition , retrying 
> (2147483512 attempts left). Error: NETWORK_EXCEPTION. Error Message: The 
> server disconnected before a response was received.{code}
> On inspecting the broker, we saw the following errors on brokers still 
> running Kafka version 3.5.2:
>  
> {code:java}
> message:     
> Closing socket for  because of error
> exception_exception_class:    
> org.apache.kafka.common.errors.InvalidRequestException
> exception_exception_message:    
> Received request api key ADD_PARTITIONS_TO_TXN with version 4 which is not 
> enabled
> exception_stacktrace:    
> org.apache.kafka.common.errors.InvalidRequestException: Received request api 
> key ADD_PARTITIONS_TO_TXN with version 4 which is not enabled
> {code}
> On the new brokers running 3.6.1 we saw the following errors:
>  
> {code:java}
> [AddPartitionsToTxnSenderThread-1055]: AddPartitionsToTxnRequest failed for 
> node 1043 with a network exception.{code}
>  
> I can also see this :
> {code:java}
> [AddPartitionsToTxnManager broker=1055]Cancelled in-flight 
> ADD_PARTITIONS_TO_TXN request with correlation id 21120 due to node 1043 
> being disconnected (elapsed time since creation: 11ms, elapsed time since 
> send: 4ms, request timeout: 3ms){code}
> We started investigating this issue and digging through the changes in 3.6, 
> we came across some changes introduced as part of KAFKA-14402 that we thought 
> might lead to this behaviour. 
> First we could see that _transaction.partition.verification.enable_ is 
> enabled by default and enables a new code path that culminates in we sending 
> version 4 ADD_PARTITIONS_TO_TXN requests to other brokers that are generated 
> [here|https://github.com/apache/kafka/blob/29f3260a9c07e654a28620aeb93a778622a5233d/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L269].
> From a 
> [discussion|https://lists.apache.org/thread/4895wrd1z92kjb708zck4s1f62xq6r8x] 
> on the mailing list, [~jolshan] pointed out that this scenario shouldn't be 
> possible as the following code paths should prevent version 4 
> ADD_PARTITIONS_TO_TXN requests being sent to other brokers:
> [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/clients/src/main/java/org/apache/kafka/clients/NodeApiVersions.java#L130]
>  
> [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L195]
> However, these requests are still sent to other brokers in our environment.
> On further inspection of the code, I am wondering if the following code path 
> could lead to this issue:
> [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L500]
> In this scenario, we don't have any _NodeApiVersions_ available for the 
> specified nodeId and potentially skipping the _latestUsableVersion_ check. I 
> am wondering if it is possible that because _discoverBrokerVersions_ is set 
> to _false_ for the network client of the {_}AddPartitionsToTxnManager{_}, it 
> skips fetching {_}NodeApiVersions{_}? I can see that we create the network 
> client here:
> [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/core/src/main/scala/kafka/server/KafkaServer.scala#L641]
> The _NetworkUtils.buildNetworkClient_ method seems to create a network client 
> that has _discoverBrokerVersions_ set to {_}false{_}. 
> I was hoping I could get some assistance 

[jira] [Commented] (KAFKA-16692) InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not enabled when upgrading from kafka 3.5 to 3.6

2024-05-10 Thread Johnson Okorie (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845380#comment-17845380
 ] 

Johnson Okorie commented on KAFKA-16692:


Thanks [~jolshan], looking forward to see you findings!

> InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not 
> enabled when upgrading from kafka 3.5 to 3.6 
> 
>
> Key: KAFKA-16692
> URL: https://issues.apache.org/jira/browse/KAFKA-16692
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 3.6.1
>Reporter: Johnson Okorie
>Assignee: Justine Olshan
>Priority: Major
>
> We have a kafka cluster running on version 3.5.2 that we are upgrading to 
> 3.6.1. This cluster has a lot of clients with exactly one semantics enabled 
> and hence creating transactions. As we replaced brokers with the new 
> binaries, we observed lots of clients in the cluster experiencing the 
> following error:
> {code:java}
> 2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=, 
> transactionalId=] Got error produce response with 
> correlation id 6402937 on topic-partition , retrying 
> (2147483512 attempts left). Error: NETWORK_EXCEPTION. Error Message: The 
> server disconnected before a response was received.{code}
> On inspecting the broker, we saw the following errors on brokers still 
> running Kafka version 3.5.2:
>  
> {code:java}
> message:     
> Closing socket for  because of error
> exception_exception_class:    
> org.apache.kafka.common.errors.InvalidRequestException
> exception_exception_message:    
> Received request api key ADD_PARTITIONS_TO_TXN with version 4 which is not 
> enabled
> exception_stacktrace:    
> org.apache.kafka.common.errors.InvalidRequestException: Received request api 
> key ADD_PARTITIONS_TO_TXN with version 4 which is not enabled
> {code}
> On the new brokers running 3.6.1 we saw the following errors:
>  
> {code:java}
> [AddPartitionsToTxnSenderThread-1055]: AddPartitionsToTxnRequest failed for 
> node 1043 with a network exception.{code}
>  
> I can also see this :
> {code:java}
> [AddPartitionsToTxnManager broker=1055]Cancelled in-flight 
> ADD_PARTITIONS_TO_TXN request with correlation id 21120 due to node 1043 
> being disconnected (elapsed time since creation: 11ms, elapsed time since 
> send: 4ms, request timeout: 3ms){code}
> We started investigating this issue and digging through the changes in 3.6, 
> we came across some changes introduced as part of KAFKA-14402 that we thought 
> might lead to this behaviour. 
> First we could see that _transaction.partition.verification.enable_ is 
> enabled by default and enables a new code path that culminates in we sending 
> version 4 ADD_PARTITIONS_TO_TXN requests to other brokers that are generated 
> [here|https://github.com/apache/kafka/blob/29f3260a9c07e654a28620aeb93a778622a5233d/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L269].
> From a 
> [discussion|https://lists.apache.org/thread/4895wrd1z92kjb708zck4s1f62xq6r8x] 
> on the mailing list, [~jolshan] pointed out that this scenario shouldn't be 
> possible as the following code paths should prevent version 4 
> ADD_PARTITIONS_TO_TXN requests being sent to other brokers:
> [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/clients/src/main/java/org/apache/kafka/clients/NodeApiVersions.java#L130]
>  
> [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L195]
> However, these requests are still sent to other brokers in our environment.
> On further inspection of the code, I am wondering if the following code path 
> could lead to this issue:
> [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L500]
> In this scenario, we don't have any _NodeApiVersions_ available for the 
> specified nodeId and potentially skipping the _latestUsableVersion_ check. I 
> am wondering if it is possible that because _discoverBrokerVersions_ is set 
> to _false_ for the network client of the {_}AddPartitionsToTxnManager{_}, it 
> skips fetching {_}NodeApiVersions{_}? I can see that we create the network 
> client here:
> [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/core/src/main/scala/kafka/server/KafkaServer.scala#L641]
> The _NetworkUtils.buildNetworkClient_ method seems to create a network client 
> that has _discoverBrokerVersions_ set to {_}false{_}. 
> I was hoping I could get some assistance debugging this issue. Happy to 
> provide any additional information needed.
>  
>  
>  



--
This message was sent by Atlassian Jira

[jira] [Updated] (KAFKA-16692) InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not enabled when upgrading from kafka 3.5 to 3.6

2024-05-09 Thread Johnson Okorie (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johnson Okorie updated KAFKA-16692:
---
Description: 
We have a kafka cluster running on version 3.5.2 that we are upgrading to 
3.6.1. This cluster has a lot of clients with exactly one semantics enabled and 
hence creating transactions. As we replaced brokers with the new binaries, we 
observed lots of clients in the cluster experiencing the following error:
{code:java}
2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=, 
transactionalId=] Got error produce response with correlation 
id 6402937 on topic-partition , retrying (2147483512 attempts 
left). Error: NETWORK_EXCEPTION. Error Message: The server disconnected before 
a response was received.{code}
On inspecting the broker, we saw the following errors on brokers still running 
Kafka version 3.5.2:

 
{code:java}
message:     
Closing socket for  because of error
exception_exception_class:    
org.apache.kafka.common.errors.InvalidRequestException
exception_exception_message:    
Received request api key ADD_PARTITIONS_TO_TXN with version 4 which is not 
enabled
exception_stacktrace:    
org.apache.kafka.common.errors.InvalidRequestException: Received request api 
key ADD_PARTITIONS_TO_TXN with version 4 which is not enabled
{code}
On the new brokers running 3.6.1 we saw the following errors:

 
{code:java}
[AddPartitionsToTxnSenderThread-1055]: AddPartitionsToTxnRequest failed for 
node 1043 with a network exception.{code}
 

I can also see this :
{code:java}
[AddPartitionsToTxnManager broker=1055]Cancelled in-flight 
ADD_PARTITIONS_TO_TXN request with correlation id 21120 due to node 1043 being 
disconnected (elapsed time since creation: 11ms, elapsed time since send: 4ms, 
request timeout: 3ms){code}
We started investigating this issue and digging through the changes in 3.6, we 
came across some changes introduced as part of KAFKA-14402 that we thought 
might lead to this behaviour. 

First we could see that _transaction.partition.verification.enable_ is enabled 
by default and enables a new code path that culminates in we sending version 4 
ADD_PARTITIONS_TO_TXN requests to other brokers that are generated 
[here|https://github.com/apache/kafka/blob/29f3260a9c07e654a28620aeb93a778622a5233d/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L269].

>From a 
>[discussion|https://lists.apache.org/thread/4895wrd1z92kjb708zck4s1f62xq6r8x] 
>on the mailing list, [~jolshan] pointed out that this scenario shouldn't be 
>possible as the following code paths should prevent version 4 
>ADD_PARTITIONS_TO_TXN requests being sent to other brokers:

[https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/clients/src/main/java/org/apache/kafka/clients/NodeApiVersions.java#L130]
 
[https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L195]

However, these requests are still sent to other brokers in our environment.

On further inspection of the code, I am wondering if the following code path 
could lead to this issue:

[https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L500]

In this scenario, we don't have any _NodeApiVersions_ available for the 
specified nodeId and potentially skipping the _latestUsableVersion_ check. I am 
wondering if it is possible that because _discoverBrokerVersions_ is set to 
_false_ for the network client of the {_}AddPartitionsToTxnManager{_}, it skips 
fetching {_}NodeApiVersions{_}? I can see that we create the network client 
here:

[https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/core/src/main/scala/kafka/server/KafkaServer.scala#L641]

The _NetworkUtils.buildNetworkClient_ method seems to create a network client 
that has _discoverBrokerVersions_ set to {_}false{_}. 

I was hoping I could get some assistance debugging this issue. Happy to provide 
any additional information needed.

 

 

 

  was:
We have a kafka cluster running on version 3.5.2 that we are upgrading to 
3.6.1. This cluster has a lot of clients with exactly one semantics enabled and 
hence creating transactions. As we replaced brokers with the new binaries, we 
observed lots of clients in the cluster experiencing the following error:
{code:java}
2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=, 
transactionalId=] Got error produce response with correlation 
id 6402937 on topic-partition , retrying (2147483512 attempts 
left). Error: NETWORK_EXCEPTION. Error Message: The server disconnected before 
a response was received.{code}
On inspecting the broker, we saw the following errors on brokers still running 
Kafka version 3.5.2:

 
{code:java}
message:     
Closing socket for  because of error
exception_exception_class:    

[jira] [Updated] (KAFKA-16692) InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not enabled when upgrading from kafka 3.5 to 3.6

2024-05-09 Thread Johnson Okorie (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johnson Okorie updated KAFKA-16692:
---
Description: 
We have a kafka cluster running on version 3.5.2 that we are upgrading to 
3.6.1. This cluster has a lot of clients with exactly one semantics enabled and 
hence creating transactions. As we replaced brokers with the new binaries, we 
observed lots of clients in the cluster experiencing the following error:
{code:java}
2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=, 
transactionalId=] Got error produce response with correlation 
id 6402937 on topic-partition , retrying (2147483512 attempts 
left). Error: NETWORK_EXCEPTION. Error Message: The server disconnected before 
a response was received.{code}
On inspecting the broker, we saw the following errors on brokers still running 
Kafka version 3.5.2:

 
{code:java}
message:     
Closing socket for  because of error
exception_exception_class:    
org.apache.kafka.common.errors.InvalidRequestException
exception_exception_message:    
Received request api key ADD_PARTITIONS_TO_TXN with version 4 which is not 
enabled
exception_stacktrace:    
org.apache.kafka.common.errors.InvalidRequestException: Received request api 
key ADD_PARTITIONS_TO_TXN with version 4 which is not enabled
{code}
On the new brokers running 3.6.1 we saw the following errors:

 
{code:java}
[AddPartitionsToTxnSenderThread-1055]: AddPartitionsToTxnRequest failed for 
node 1043 with a network exception.{code}
 

I can also see this :
{code:java}
[AddPartitionsToTxnManager broker=1055]Cancelled in-flight 
ADD_PARTITIONS_TO_TXN request with correlation id 21120 due to node 1043 being 
disconnected (elapsed time since creation: 11ms, elapsed time since send: 4ms, 
request timeout: 3ms){code}
We started investigating this issue and digging through the changes in 3.6, we 
came across some changes introduced as part of KAFKA-14402 that we thought 
might lead to this behaviour. 

First we could see that _transaction.partition.verification.enable_ is enabled 
by default and enables a new code path that culminates in we sending version 4 
ADD_PARTITIONS_TO_TXN requests to other brokers that are generated 
[here|https://github.com/apache/kafka/blob/29f3260a9c07e654a28620aeb93a778622a5233d/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L269].

>From a 
>[discussion|https://lists.apache.org/thread/4895wrd1z92kjb708zck4s1f62xq6r8x] 
>on the mailing list, [~jolshan] pointed out that this scenario shouldn't be 
>possible as the following code paths should prevent version 4 
>ADD_PARTITIONS_TO_TXN requests being sent to other brokers:

[https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/clients/src/main/java/org/apache/kafka/clients/NodeApiVersions.java#L130]
 
[https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L195]

However, these requests are still sent to other brokers in our environment.

On further inspection of the code, I am wondering if the following code path 
could lead to this issue:

[https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L500]

In this scenario, we don't have any _NodeApiVersions_ available for the 
specified nodeId and potentially skipping _latestUsableVersion_ check. I am 
wondering if it is possible that because _discoverBrokerVersions_ is set to 
_false_ for the network client of the {_}AddPartitionsToTxnManager{_}, it skips 
fetching {_}NodeApiVersions{_}? I can see that we create the network client 
here:

[https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/core/src/main/scala/kafka/server/KafkaServer.scala#L641]

The _NetworkUtils.buildNetworkClient_ method seems to create a network client 
that has _discoverBrokerVersions_ set to {_}false{_}. 

I was hoping I could get some assistance debugging this issue. Happy to provide 
any additional information needed.

 

 

 

  was:
We have a kafka cluster running on version 3.5.2 that we are upgrading to 
3.6.1. This cluster has a lot of clients with exactly one semantics enabled and 
hence creating transactions. As we replaced brokers with the new binaries, we 
observed lots of clients in the cluster experiencing the following error:
{code:java}
2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=, 
transactionalId=] Got error produce response with correlation 
id 6402937 on topic-partition , retrying (2147483512 attempts 
left). Error: NETWORK_EXCEPTION. Error Message: The server disconnected before 
a response was received.{code}
On inspecting the broker, we saw the following errors on brokers still running 
Kafka version 3.5.2:

 
{code:java}
message:     
Closing socket for  because of error
exception_exception_class:    

[jira] [Updated] (KAFKA-16692) InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not enabled when upgrading from kafka 3.5 to 3.6

2024-05-09 Thread Johnson Okorie (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johnson Okorie updated KAFKA-16692:
---
Description: 
We have a kafka cluster running on version 3.5.2 that we are upgrading to 
3.6.1. This cluster has a lot of clients with exactly one semantics enabled and 
hence creating transactions. As we replaced brokers with the new binaries, we 
observed lots of clients in the cluster experiencing the following error:
{code:java}
2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=, 
transactionalId=] Got error produce response with correlation 
id 6402937 on topic-partition , retrying (2147483512 attempts 
left). Error: NETWORK_EXCEPTION. Error Message: The server disconnected before 
a response was received.{code}
On inspecting the broker, we saw the following errors on brokers still running 
Kafka version 3.5.2:

 
{code:java}
message:     
Closing socket for  because of error
exception_exception_class:    
org.apache.kafka.common.errors.InvalidRequestException
exception_exception_message:    
Received request api key ADD_PARTITIONS_TO_TXN with version 4 which is not 
enabled
exception_stacktrace:    
org.apache.kafka.common.errors.InvalidRequestException: Received request api 
key ADD_PARTITIONS_TO_TXN with version 4 which is not enabled
{code}
On the new brokers running 3.6.1 we saw the following errors:

 
{code:java}
[AddPartitionsToTxnSenderThread-1055]: AddPartitionsToTxnRequest failed for 
node 1043 with a network exception.{code}
 

I can also see this :
{code:java}
[AddPartitionsToTxnManager broker=1055]Cancelled in-flight 
ADD_PARTITIONS_TO_TXN request with correlation id 21120 due to node 1043 being 
disconnected (elapsed time since creation: 11ms, elapsed time since send: 4ms, 
request timeout: 3ms){code}
We started investigating this issue and digging through the changes in 3.6, we 
came across some changes introduced as part of KAFKA-14402 that we thought 
might lead to this behaviour. 

First we could see that _transaction.partition.verification.enable_ is enabled 
by default and enables a new code path that culminates in we sending version 4 
ADD_PARTITIONS_TO_TXN requests to other brokers that are generated 
[here|https://github.com/apache/kafka/blob/29f3260a9c07e654a28620aeb93a778622a5233d/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L269].

>From a 
>[discussion|https://lists.apache.org/thread/4895wrd1z92kjb708zck4s1f62xq6r8x] 
>on the mailing list, [~jolshan] pointed out that this scenario shouldn't be 
>possible as the following code paths should prevent version 4 
>ADD_PARTITIONS_TO_TXN requests being sent to other brokers:

[https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/clients/src/main/java/org/apache/kafka/clients/NodeApiVersions.java#L130]
 
[https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L195]

However, these requests are still sent to other brokers in our environment.

On further inspection of the code, I am wondering if the following code path 
could lead to this issue:

[https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L500]

In this scenario, we don't have any _NodeApiVersions_ available for the 
specified nodeId and potentially skipping _latestUsableVersion_ check as 
expected. I am wondering if it is possible that because 
_discoverBrokerVersions_ is set to _false_ for the network client of the 
{_}AddPartitionsToTxnManager{_}, it skips fetching {_}NodeApiVersions{_}? I can 
see that we create the network client here:

[https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/core/src/main/scala/kafka/server/KafkaServer.scala#L641]

The _NetworkUtils.buildNetworkClient_ method seems to create a network client 
that has _discoverBrokerVersions_ set to {_}false{_}. 

I was hoping I could get some assistance debugging this issue. Happy to provide 
any additional information needed.

 

 

 

  was:
We have a kafka cluster running on version 3.5.2 that we are upgrading to 
3.6.1. This cluster has a lot of clients with exactly one semantics enabled and 
hence creating transactions. As we replaced brokers with the new binaries, we 
observed lots of clients in the cluster experiencing the following error:
{code:java}
2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=, 
transactionalId=] Got error produce response with correlation 
id 6402937 on topic-partition , retrying (2147483512 attempts 
left). Error: NETWORK_EXCEPTION. Error Message: The server disconnected before 
a response was received.{code}
On inspecting the broker, we saw the following errors on brokers still running 
Kafka version 3.5.2:

 
{code:java}
message:     
Closing socket for  because of error
exception_exception_class:    

[jira] [Updated] (KAFKA-16692) InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not enabled when upgrading from kafka 3.5 to 3.6

2024-05-09 Thread Johnson Okorie (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johnson Okorie updated KAFKA-16692:
---
Description: 
We have a kafka cluster running on version 3.5.2 that we are upgrading to 
3.6.1. This cluster has a lot of clients with exactly one semantics enabled and 
hence creating transactions. As we replaced brokers with the new binaries, we 
observed lots of clients in the cluster experiencing the following error:
{code:java}
2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=, 
transactionalId=] Got error produce response with correlation 
id 6402937 on topic-partition , retrying (2147483512 attempts 
left). Error: NETWORK_EXCEPTION. Error Message: The server disconnected before 
a response was received.{code}
On inspecting the broker, we saw the following errors on brokers still running 
Kafka version 3.5.2:

 
{code:java}
message:     
Closing socket for  because of error
exception_exception_class:    
org.apache.kafka.common.errors.InvalidRequestException
exception_exception_message:    
Received request api key ADD_PARTITIONS_TO_TXN with version 4 which is not 
enabled
exception_stacktrace:    
org.apache.kafka.common.errors.InvalidRequestException: Received request api 
key ADD_PARTITIONS_TO_TXN with version 4 which is not enabled
{code}
On the new brokers running 3.6.1 we saw the following errors:

 
{code:java}
[AddPartitionsToTxnSenderThread-1055]: AddPartitionsToTxnRequest failed for 
node 1043 with a network exception.{code}
 

I can also see this :
{code:java}
[AddPartitionsToTxnManager broker=1055]Cancelled in-flight 
ADD_PARTITIONS_TO_TXN request with correlation id 21120 due to node 1043 being 
disconnected (elapsed time since creation: 11ms, elapsed time since send: 4ms, 
request timeout: 3ms){code}
We started investigating this issue and digging through the changes in 3.6, we 
came across some changes introduced as part of KAFKA-14402 that we thought 
might lead to this behaviour. 

First we could see that _transaction.partition.verification.enable_ is enabled 
by default and enables a new code path that culminates in we sending version 4 
ADD_PARTITIONS_TO_TXN requests to other brokers that are generated 
[here|https://github.com/apache/kafka/blob/29f3260a9c07e654a28620aeb93a778622a5233d/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L269].

>From a 
>[discussion|https://lists.apache.org/thread/4895wrd1z92kjb708zck4s1f62xq6r8x] 
>on the mailing list, [~jolshan] pointed out that this scenario shouldn't be 
>possible as the following code paths should prevent version 4 
>ADD_PARTITIONS_TO_TXN requests being sent to other brokers:

[https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/clients/src/main/java/org/apache/kafka/clients/NodeApiVersions.java#L130]
 
[https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L195]

However, this seems to be these requests are still sent to other brokers in our 
environment.

On further inspection of the code, I am wondering if the following code path 
could lead to this issue:

[https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L500]

In this scenario, we don't have any _NodeApiVersions_ available for the 
specified nodeId and potentially skipping _latestUsableVersion_ check as 
expected. I am wondering if it is possible that because 
_discoverBrokerVersions_ is set to false for the network client of the 
AddPartitionsToTxnManager, it skips fetching ApiVersions? I can see here that 
we create the network client here:

[https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/core/src/main/scala/kafka/server/KafkaServer.scala#L641]

This _NetworkUtils.buildNetworkClient_ seems to create a network client that 
has _discoverBrokerVersions_ set to false. 

I was hoping I could get some assistance debugging this issue. Happy to provide 
any additional information needed.

 

 

 

  was:
We have a kafka cluster running on version 3.5.2 that we are upgrading to 
3.6.1. This cluster has a lot of clients with exactly one semantics enabled and 
hence creating transactions. As we replaced brokers with the new binaries, we 
observed lots of clients in the cluster experiencing the following error:
{code:java}
2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=, 
transactionalId=] Got error produce response with correlation 
id 6402937 on topic-partition , retrying (2147483512 attempts 
left). Error: NETWORK_EXCEPTION. Error Message: The server disconnected before 
a response was received.{code}
On inspecting the broker, we saw the following errors on brokers still running 
Kafka version 3.5.2:

 
{code:java}
message:     
Closing socket for  because of error
exception_exception_class:    

[jira] [Updated] (KAFKA-16692) InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not enabled when upgrading from kafka 3.5 to 3.6

2024-05-09 Thread Johnson Okorie (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johnson Okorie updated KAFKA-16692:
---
Summary: InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 
which is not enabled when upgrading from kafka 3.5 to 3.6   (was: 
InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not 
enabled when )

> InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not 
> enabled when upgrading from kafka 3.5 to 3.6 
> 
>
> Key: KAFKA-16692
> URL: https://issues.apache.org/jira/browse/KAFKA-16692
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 3.6.1
>Reporter: Johnson Okorie
>Assignee: Justine Olshan
>Priority: Major
>
> We have a kafka cluster running on version 3.5.2 that we are upgrading to 
> 3.6.1. This cluster has a lot of clients with exactly one semantics enabled 
> and hence creating transactions. As we replaced brokers with the new 
> binaries, we observed lots of clients in the cluster experiencing the 
> following error:
> {code:java}
> 2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=, 
> transactionalId=] Got error produce response with 
> correlation id 6402937 on topic-partition , retrying 
> (2147483512 attempts left). Error: NETWORK_EXCEPTION. Error Message: The 
> server disconnected before a response was received.{code}
> On inspecting the broker, we saw the following errors on brokers still 
> running Kafka version 3.5.2:
>  
> {code:java}
> message:     
> Closing socket for  because of error
> exception_exception_class:    
> org.apache.kafka.common.errors.InvalidRequestException
> exception_exception_message:    
> Received request api key ADD_PARTITIONS_TO_TXN with version 4 which is not 
> enabled
> exception_stacktrace:    
> org.apache.kafka.common.errors.InvalidRequestException: Received request api 
> key ADD_PARTITIONS_TO_TXN with version 4 which is not enabled
> {code}
> On the new brokers running 3.6.1 we saw the following errors:
>  
> {code:java}
> [AddPartitionsToTxnSenderThread-1055]: AddPartitionsToTxnRequest failed for 
> node 1043 with a network exception.{code}
>  
> I can also see this :
> {code:java}
> [AddPartitionsToTxnManager broker=1055]Cancelled in-flight 
> ADD_PARTITIONS_TO_TXN request with correlation id 21120 due to node 1043 
> being disconnected (elapsed time since creation: 11ms, elapsed time since 
> send: 4ms, request timeout: 3ms){code}
> We started investigating this issue and digging through the changes in 3.6, 
> we came across some changes introduced as part of KAFKA-14402 that we thought 
> might lead to this behaviour. 
> First we could see that _transaction.partition.verification.enable_ is 
> enabled by default and enables a new code path that culminates in we sending 
> version 4 ADD_PARTITIONS_TO_TXN requests to other brokers that are generated 
> [here|#L269]].
> From a 
> [discussion|https://lists.apache.org/thread/4895wrd1z92kjb708zck4s1f62xq6r8x] 
> on the mailing list, [~jolshan] pointed out that this scenario shouldn't be 
> possible as the following code paths should prevent version 4 
> ADD_PARTITIONS_TO_TXN requests being sent to other brokers:
> [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/clients/src/main/java/org/apache/kafka/clients/NodeApiVersions.java#L130]
>  
> [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L195]
> However, this seems to be these requests are still sent to other brokers in 
> our environment.
> On further inspection of the code, I am wondering if the following code path 
> could lead to this issue:
> [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L500]
> In this scenario, we don't have any _NodeApiVersions_ available for the 
> specified nodeId and potentially skipping _latestUsableVersion_ check as 
> expected. I am wondering if it is possible that because 
> _discoverBrokerVersions_ is set to false for the network client of the 
> AddPartitionsToTxnManager, it skips fetching ApiVersions? I can see here that 
> we create the network client here:
> [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/core/src/main/scala/kafka/server/KafkaServer.scala#L641]
> This _NetworkUtils.buildNetworkClient_ seems to create a network client that 
> has _discoverBrokerVersions_ set to false. 
> I was hoping I could get some assistance debugging this issue. Happy to 
> provide any additional information needed.
>  
>  
>  



--
This message was sent by Atlassian Jira

[jira] [Updated] (KAFKA-16692) InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not enabled when

2024-05-08 Thread Johnson Okorie (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johnson Okorie updated KAFKA-16692:
---
Description: 
We have a kafka cluster running on version 3.5.2 that we are upgrading to 
3.6.1. This cluster has a lot of clients with exactly one semantics enabled and 
hence creating transactions. As we replaced brokers with the new binaries, we 
observed lots of clients in the cluster experiencing the following error:
{code:java}
2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=, 
transactionalId=] Got error produce response with correlation 
id 6402937 on topic-partition , retrying (2147483512 attempts 
left). Error: NETWORK_EXCEPTION. Error Message: The server disconnected before 
a response was received.{code}
On inspecting the broker, we saw the following errors on brokers still running 
Kafka version 3.5.2:

 
{code:java}
message:     
Closing socket for  because of error
exception_exception_class:    
org.apache.kafka.common.errors.InvalidRequestException
exception_exception_message:    
Received request api key ADD_PARTITIONS_TO_TXN with version 4 which is not 
enabled
exception_stacktrace:    
org.apache.kafka.common.errors.InvalidRequestException: Received request api 
key ADD_PARTITIONS_TO_TXN with version 4 which is not enabled
{code}
On the new brokers running 3.6.1 we saw the following errors:

 
{code:java}
[AddPartitionsToTxnSenderThread-1055]: AddPartitionsToTxnRequest failed for 
node 1043 with a network exception.{code}
 

I can also see this :
{code:java}
[AddPartitionsToTxnManager broker=1055]Cancelled in-flight 
ADD_PARTITIONS_TO_TXN request with correlation id 21120 due to node 1043 being 
disconnected (elapsed time since creation: 11ms, elapsed time since send: 4ms, 
request timeout: 3ms){code}
We started investigating this issue and digging through the changes in 3.6, we 
came across some changes introduced as part of KAFKA-14402 that we thought 
might lead to this behaviour. 

First we could see that _transaction.partition.verification.enable_ is enabled 
by default and enables a new code path that culminates in we sending version 4 
ADD_PARTITIONS_TO_TXN requests to other brokers that are generated 
[here|#L269]].

>From a 
>[discussion|https://lists.apache.org/thread/4895wrd1z92kjb708zck4s1f62xq6r8x] 
>on the mailing list, [~jolshan] pointed out that this scenario shouldn't be 
>possible as the following code paths should prevent version 4 
>ADD_PARTITIONS_TO_TXN requests being sent to other brokers:

[https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/clients/src/main/java/org/apache/kafka/clients/NodeApiVersions.java#L130]
 
[https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L195]

However, this seems to be these requests are still sent to other brokers in our 
environment.

On further inspection of the code, I am wondering if the following code path 
could lead to this issue:

[https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L500]

In this scenario, we don't have any _NodeApiVersions_ available for the 
specified nodeId and potentially skipping _latestUsableVersion_ check as 
expected. I am wondering if it is possible that because 
_discoverBrokerVersions_ is set to false for the network client of the 
AddPartitionsToTxnManager, it skips fetching ApiVersions? I can see here that 
we create the network client here:

[https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/core/src/main/scala/kafka/server/KafkaServer.scala#L641]

This _NetworkUtils.buildNetworkClient_ seems to create a network client that 
has _discoverBrokerVersions_ set to false. 

I was hoping I could get some assistance debugging this issue. Happy to provide 
any additional information needed.

 

 

 

  was:
We have a kafka cluster running on version 3.5.2 that we are upgrading to 
3.6.2. This cluster has a lot of clients with exactly one semantics enabled and 
hence creating transactions. As we replaced brokers with the new binaries, we 
observed lots of clients in the cluster experiencing the following error:
{code:java}
2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=, 
transactionalId=] Got error produce response with correlation 
id 6402937 on topic-partition , retrying (2147483512 attempts 
left). Error: NETWORK_EXCEPTION. Error Message: The server disconnected before 
a response was received.{code}
On inspecting the broker, we saw the following errors on brokers still running 
Kafka version 3.5.2:

 
{code:java}
message:     
Closing socket for  because of error
exception_exception_class:    
org.apache.kafka.common.errors.InvalidRequestException
exception_exception_message:    
Received request api key ADD_PARTITIONS_TO_TXN with version 4 

[jira] [Updated] (KAFKA-16692) InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not enabled when

2024-05-08 Thread Johnson Okorie (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johnson Okorie updated KAFKA-16692:
---
Affects Version/s: 3.6.1
   (was: 3.6.2)

> InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not 
> enabled when 
> 
>
> Key: KAFKA-16692
> URL: https://issues.apache.org/jira/browse/KAFKA-16692
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 3.6.1
>Reporter: Johnson Okorie
>Priority: Major
>
> We have a kafka cluster running on version 3.5.2 that we are upgrading to 
> 3.6.2. This cluster has a lot of clients with exactly one semantics enabled 
> and hence creating transactions. As we replaced brokers with the new 
> binaries, we observed lots of clients in the cluster experiencing the 
> following error:
> {code:java}
> 2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=, 
> transactionalId=] Got error produce response with 
> correlation id 6402937 on topic-partition , retrying 
> (2147483512 attempts left). Error: NETWORK_EXCEPTION. Error Message: The 
> server disconnected before a response was received.{code}
> On inspecting the broker, we saw the following errors on brokers still 
> running Kafka version 3.5.2:
>  
> {code:java}
> message:     
> Closing socket for  because of error
> exception_exception_class:    
> org.apache.kafka.common.errors.InvalidRequestException
> exception_exception_message:    
> Received request api key ADD_PARTITIONS_TO_TXN with version 4 which is not 
> enabled
> exception_stacktrace:    
> org.apache.kafka.common.errors.InvalidRequestException: Received request api 
> key ADD_PARTITIONS_TO_TXN with version 4 which is not enabled
> {code}
> On the new brokers running 3.6.2 we saw the following errors:
>  
> {code:java}
> [AddPartitionsToTxnSenderThread-1055]: AddPartitionsToTxnRequest failed for 
> node 1043 with a network exception.{code}
>  
> I can also see this :
> {code:java}
> [AddPartitionsToTxnManager broker=1055]Cancelled in-flight 
> ADD_PARTITIONS_TO_TXN request with correlation id 21120 due to node 1043 
> being disconnected (elapsed time since creation: 11ms, elapsed time since 
> send: 4ms, request timeout: 3ms){code}
> We started investigating this issue and digging through the changes in 3.6, 
> we came across some changes introduced as part of KAFKA-14402 that we thought 
> might lead to this behaviour. 
> First we could see that _transaction.partition.verification.enable_ is 
> enabled by default and enables a new code path that culminates in we sending 
> version 4 ADD_PARTITIONS_TO_TXN requests to other brokers that are generated 
> [here|#L269]].
> From a 
> [discussion|https://lists.apache.org/thread/4895wrd1z92kjb708zck4s1f62xq6r8x] 
> on the mailing list, [~jolshan] pointed out that this scenario shouldn't be 
> possible as the following code paths should prevent version 4 
> ADD_PARTITIONS_TO_TXN requests being sent to other brokers:
> [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/clients/src/main/java/org/apache/kafka/clients/NodeApiVersions.java#L130]
>  
> [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L195]
> However, this seems to be these requests are still sent to other brokers in 
> our environment.
> On further inspection of the code, I am wondering if the following code path 
> could lead to this issue:
> [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L500]
> In this scenario, we don't have any _NodeApiVersions_ available for the 
> specified nodeId and potentially skipping _latestUsableVersion_ check as 
> expected. I am wondering if it is possible that because 
> _discoverBrokerVersions_ is set to false for the network client of the 
> AddPartitionsToTxnManager, it skips fetching ApiVersions? I can see here that 
> we create the network client here:
> [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/core/src/main/scala/kafka/server/KafkaServer.scala#L641]
> This _NetworkUtils.buildNetworkClient_ seems to create a network client that 
> has _discoverBrokerVersions_ set to false. 
> I was hoping I could get some assistance debugging this issue. Happy to 
> provide any additional information needed.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-16692) InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not enabled when

2024-05-08 Thread Johnson Okorie (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johnson Okorie updated KAFKA-16692:
---
Description: 
We have a kafka cluster running on version 3.5.2 that we are upgrading to 
3.6.2. This cluster has a lot of clients with exactly one semantics enabled and 
hence creating transactions. As we replaced brokers with the new binaries, we 
observed lots of clients in the cluster experiencing the following error:
{code:java}
2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=, 
transactionalId=] Got error produce response with correlation 
id 6402937 on topic-partition , retrying (2147483512 attempts 
left). Error: NETWORK_EXCEPTION. Error Message: The server disconnected before 
a response was received.{code}
On inspecting the broker, we saw the following errors on brokers still running 
Kafka version 3.5.2:

 
{code:java}
message:     
Closing socket for  because of error
exception_exception_class:    
org.apache.kafka.common.errors.InvalidRequestException
exception_exception_message:    
Received request api key ADD_PARTITIONS_TO_TXN with version 4 which is not 
enabled
exception_stacktrace:    
org.apache.kafka.common.errors.InvalidRequestException: Received request api 
key ADD_PARTITIONS_TO_TXN with version 4 which is not enabled
{code}
On the new brokers running 3.6.2 we saw the following errors:

 
{code:java}
[AddPartitionsToTxnSenderThread-1055]: AddPartitionsToTxnRequest failed for 
node 1043 with a network exception.{code}
 

I can also see this :
{code:java}
[AddPartitionsToTxnManager broker=1055]Cancelled in-flight 
ADD_PARTITIONS_TO_TXN request with correlation id 21120 due to node 1043 being 
disconnected (elapsed time since creation: 11ms, elapsed time since send: 4ms, 
request timeout: 3ms){code}
We started investigating this issue and digging through the changes in 3.6, we 
came across some changes introduced as part of KAFKA-14402 that we thought 
might lead to this behaviour. 

First we could see that _transaction.partition.verification.enable_ is enabled 
by default and enables a new code path that culminates in we sending version 4 
ADD_PARTITIONS_TO_TXN requests to other brokers that are generated 
[here|#L269]].

>From a 
>[discussion|https://lists.apache.org/thread/4895wrd1z92kjb708zck4s1f62xq6r8x] 
>on the mailing list, [~jolshan] pointed out that this scenario shouldn't be 
>possible as the following code paths should prevent version 4 
>ADD_PARTITIONS_TO_TXN requests being sent to other brokers:

[https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/clients/src/main/java/org/apache/kafka/clients/NodeApiVersions.java#L130]
 
[https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L195]

However, this seems to be these requests are still sent to other brokers in our 
environment.

On further inspection of the code, I am wondering if the following code path 
could lead to this issue:

[https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L500]

In this scenario, we don't have any _NodeApiVersions_ available for the 
specified nodeId and potentially skipping _latestUsableVersion_ check as 
expected. I am wondering if it is possible that because 
_discoverBrokerVersions_ is set to false for the network client of the 
AddPartitionsToTxnManager, it skips fetching ApiVersions? I can see here that 
we create the network client here:

[https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/core/src/main/scala/kafka/server/KafkaServer.scala#L641]

This _NetworkUtils.buildNetworkClient_ seems to create a network client that 
has _discoverBrokerVersions_ set to false. 

I was hoping I could get some assistance debugging this issue. Happy to provide 
any additional information needed.

 

 

 

  was:
We have a kafka cluster running on version 3.5.2 that we are upgrading to 
3.6.2. This cluster has a lot of clients with exactly one semantics enabled and 
hence creating transactions. As we replaced brokers with the new binaries, we 
observed lots of clients in the cluster experiencing the following error:


{code:java}
2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=, 
transactionalId=] Got error produce response with correlation 
id 6402937 on topic-partition , retrying (2147483512 attempts 
left). Error: NETWORK_EXCEPTION. Error Message: The server disconnected before 
a response was received.{code}
On inspecting the broker, we saw the following errors on brokers still running 
Kafka version 3.5.2:

 
{code:java}
message:     
Closing socket for  because of error
exception_exception_class:    
org.apache.kafka.common.errors.InvalidRequestException
exception_exception_message:    
Received request api key ADD_PARTITIONS_TO_TXN with version 

[jira] [Created] (KAFKA-16692) InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not enabled when

2024-05-08 Thread Johnson Okorie (Jira)
Johnson Okorie created KAFKA-16692:
--

 Summary: InvalidRequestException: ADD_PARTITIONS_TO_TXN with 
version 4 which is not enabled when 
 Key: KAFKA-16692
 URL: https://issues.apache.org/jira/browse/KAFKA-16692
 Project: Kafka
  Issue Type: Bug
  Components: core
Affects Versions: 3.6.2
Reporter: Johnson Okorie


We have a kafka cluster running on version 3.5.2 that we are upgrading to 
3.6.2. This cluster has a lot of clients with exactly one semantics enabled and 
hence creating transactions. As we replaced brokers with the new binaries, we 
observed lots of clients in the cluster experiencing the following error:


{code:java}
2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=, 
transactionalId=] Got error produce response with correlation 
id 6402937 on topic-partition , retrying (2147483512 attempts 
left). Error: NETWORK_EXCEPTION. Error Message: The server disconnected before 
a response was received.{code}
On inspecting the broker, we saw the following errors on brokers still running 
Kafka version 3.5.2:

 
{code:java}
message:     
Closing socket for  because of error
exception_exception_class:    
org.apache.kafka.common.errors.InvalidRequestException
exception_exception_message:    
Received request api key ADD_PARTITIONS_TO_TXN with version 4 which is not 
enabled
exception_stacktrace:    
org.apache.kafka.common.errors.InvalidRequestException: Received request api 
key ADD_PARTITIONS_TO_TXN with version 4 which is not enabled
{code}
On the new brokers running 3.6.2 we saw the following errors:

 
{code:java}
[AddPartitionsToTxnSenderThread-1055]: AddPartitionsToTxnRequest failed for 
node 1043 with a network exception.{code}
 

I can also see this :
{code:java}
[AddPartitionsToTxnManager broker=1055]Cancelled in-flight 
ADD_PARTITIONS_TO_TXN request with correlation id 21120 due to node 1043 being 
disconnected (elapsed time since creation: 11ms, elapsed time since send: 4ms, 
request timeout: 3ms){code}

We started investigating this issue and digging through the changes in 3.6, we 
came across some changes introduced as part of 
[KAFKA-14402|https://issues.apache.org/jira/browse/KAFKA-14402] that we thought 
might lead to this behaviour. 

First we could see that _transaction.partition.verification.enable_ is enabled 
by default and enables a new code path that culminates in we sending version 4 
ADD_PARTITIONS_TO_TXN requests to other brokers that are generated 
[here|[https://github.com/apache/kafka/blob/cb35ddc5ca233d5cca6f51c1c41b952a7e9fe1a0/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L269]].

>From a 
>[discussion|https://lists.apache.org/thread/4895wrd1z92kjb708zck4s1f62xq6r8x] 
>on the mailing list, [~jolshan] pointed out that this scenario shouldn't be 
>possible as the following code paths should prevent version 4 
>ADD_PARTITIONS_TO_TXN requests being sent to other brokers:

[https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/clients/src/main/java/org/apache/kafka/clients/NodeApiVersions.java#L130]
 
[https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L195]

However, this seems to be these requests are still sent to other brokers in our 
environment. 

On further inspection of the code, I am wondering if the following code path 
could lead to this issue:

[https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L500]

In this scenario, we don't have any _NodeApiVersions_ available for the 
specified nodeId and potentially skipping _latestUsableVersion_ check as 
expected. I am wondering if it is possible that because 
_discoverBrokerVersions_ is set to false for the network client of the 
AddPartitionsToTxnManager, it skips fetching ApiVersions? I can see here that 
we create the network client here:

[https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/core/src/main/scala/kafka/server/KafkaServer.scala#L641]

This _NetworkUtils.buildNetworkClient_ seems to create a network client that 
has _discoverBrokerVersions_ set to false. 

I was hoping I could get some assistance debugging this issue.










 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)