[jira] [Commented] (KAFKA-16692) InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not enabled when upgrading from kafka 3.5 to 3.6

2024-05-15 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846763#comment-17846763
 ] 

Justine Olshan commented on KAFKA-16692:


Hey [~akaltsikis] yeah, those are the release notes for 3.6. We can probably 
edit the kafka-site repo to get the change to show up in real time, but 
updating via the kafka repo requires waiting for the next release for the site 
to update.

> InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not 
> enabled when upgrading from kafka 3.5 to 3.6 
> 
>
> Key: KAFKA-16692
> URL: https://issues.apache.org/jira/browse/KAFKA-16692
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 3.7.0, 3.6.1, 3.8
>Reporter: Johnson Okorie
>Assignee: Justine Olshan
>Priority: Major
>
> We have a kafka cluster running on version 3.5.2 that we are upgrading to 
> 3.6.1. This cluster has a lot of clients with exactly one semantics enabled 
> and hence creating transactions. As we replaced brokers with the new 
> binaries, we observed lots of clients in the cluster experiencing the 
> following error:
> {code:java}
> 2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=, 
> transactionalId=] Got error produce response with 
> correlation id 6402937 on topic-partition , retrying 
> (2147483512 attempts left). Error: NETWORK_EXCEPTION. Error Message: The 
> server disconnected before a response was received.{code}
> On inspecting the broker, we saw the following errors on brokers still 
> running Kafka version 3.5.2:
>  
> {code:java}
> message:     
> Closing socket for  because of error
> exception_exception_class:    
> org.apache.kafka.common.errors.InvalidRequestException
> exception_exception_message:    
> Received request api key ADD_PARTITIONS_TO_TXN with version 4 which is not 
> enabled
> exception_stacktrace:    
> org.apache.kafka.common.errors.InvalidRequestException: Received request api 
> key ADD_PARTITIONS_TO_TXN with version 4 which is not enabled
> {code}
> On the new brokers running 3.6.1 we saw the following errors:
>  
> {code:java}
> [AddPartitionsToTxnSenderThread-1055]: AddPartitionsToTxnRequest failed for 
> node 1043 with a network exception.{code}
>  
> I can also see this :
> {code:java}
> [AddPartitionsToTxnManager broker=1055]Cancelled in-flight 
> ADD_PARTITIONS_TO_TXN request with correlation id 21120 due to node 1043 
> being disconnected (elapsed time since creation: 11ms, elapsed time since 
> send: 4ms, request timeout: 3ms){code}
> We started investigating this issue and digging through the changes in 3.6, 
> we came across some changes introduced as part of KAFKA-14402 that we thought 
> might lead to this behaviour. 
> First we could see that _transaction.partition.verification.enable_ is 
> enabled by default and enables a new code path that culminates in we sending 
> version 4 ADD_PARTITIONS_TO_TXN requests to other brokers that are generated 
> [here|https://github.com/apache/kafka/blob/29f3260a9c07e654a28620aeb93a778622a5233d/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L269].
> From a 
> [discussion|https://lists.apache.org/thread/4895wrd1z92kjb708zck4s1f62xq6r8x] 
> on the mailing list, [~jolshan] pointed out that this scenario shouldn't be 
> possible as the following code paths should prevent version 4 
> ADD_PARTITIONS_TO_TXN requests being sent to other brokers:
> [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/clients/src/main/java/org/apache/kafka/clients/NodeApiVersions.java#L130]
>  
> [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L195]
> However, these requests are still sent to other brokers in our environment.
> On further inspection of the code, I am wondering if the following code path 
> could lead to this issue:
> [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L500]
> In this scenario, we don't have any _NodeApiVersions_ available for the 
> specified nodeId and potentially skipping the _latestUsableVersion_ check. I 
> am wondering if it is possible that because _discoverBrokerVersions_ is set 
> to _false_ for the network client of the {_}AddPartitionsToTxnManager{_}, it 
> skips fetching {_}NodeApiVersions{_}? I can see that we create the network 
> client here:
> [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/core/src/main/scala/kafka/server/KafkaServer.scala#L641]
> The _NetworkUtils.buildNetworkClient_ method seems to create a network client 
> that has _discoverBrokerVersions_ set to 

[jira] [Updated] (KAFKA-16692) InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not enabled when upgrading from kafka 3.5 to 3.6

2024-05-15 Thread Justine Olshan (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justine Olshan updated KAFKA-16692:
---
Affects Version/s: 3.8

> InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not 
> enabled when upgrading from kafka 3.5 to 3.6 
> 
>
> Key: KAFKA-16692
> URL: https://issues.apache.org/jira/browse/KAFKA-16692
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 3.7.0, 3.6.1, 3.8
>Reporter: Johnson Okorie
>Assignee: Justine Olshan
>Priority: Major
>
> We have a kafka cluster running on version 3.5.2 that we are upgrading to 
> 3.6.1. This cluster has a lot of clients with exactly one semantics enabled 
> and hence creating transactions. As we replaced brokers with the new 
> binaries, we observed lots of clients in the cluster experiencing the 
> following error:
> {code:java}
> 2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=, 
> transactionalId=] Got error produce response with 
> correlation id 6402937 on topic-partition , retrying 
> (2147483512 attempts left). Error: NETWORK_EXCEPTION. Error Message: The 
> server disconnected before a response was received.{code}
> On inspecting the broker, we saw the following errors on brokers still 
> running Kafka version 3.5.2:
>  
> {code:java}
> message:     
> Closing socket for  because of error
> exception_exception_class:    
> org.apache.kafka.common.errors.InvalidRequestException
> exception_exception_message:    
> Received request api key ADD_PARTITIONS_TO_TXN with version 4 which is not 
> enabled
> exception_stacktrace:    
> org.apache.kafka.common.errors.InvalidRequestException: Received request api 
> key ADD_PARTITIONS_TO_TXN with version 4 which is not enabled
> {code}
> On the new brokers running 3.6.1 we saw the following errors:
>  
> {code:java}
> [AddPartitionsToTxnSenderThread-1055]: AddPartitionsToTxnRequest failed for 
> node 1043 with a network exception.{code}
>  
> I can also see this :
> {code:java}
> [AddPartitionsToTxnManager broker=1055]Cancelled in-flight 
> ADD_PARTITIONS_TO_TXN request with correlation id 21120 due to node 1043 
> being disconnected (elapsed time since creation: 11ms, elapsed time since 
> send: 4ms, request timeout: 3ms){code}
> We started investigating this issue and digging through the changes in 3.6, 
> we came across some changes introduced as part of KAFKA-14402 that we thought 
> might lead to this behaviour. 
> First we could see that _transaction.partition.verification.enable_ is 
> enabled by default and enables a new code path that culminates in we sending 
> version 4 ADD_PARTITIONS_TO_TXN requests to other brokers that are generated 
> [here|https://github.com/apache/kafka/blob/29f3260a9c07e654a28620aeb93a778622a5233d/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L269].
> From a 
> [discussion|https://lists.apache.org/thread/4895wrd1z92kjb708zck4s1f62xq6r8x] 
> on the mailing list, [~jolshan] pointed out that this scenario shouldn't be 
> possible as the following code paths should prevent version 4 
> ADD_PARTITIONS_TO_TXN requests being sent to other brokers:
> [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/clients/src/main/java/org/apache/kafka/clients/NodeApiVersions.java#L130]
>  
> [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L195]
> However, these requests are still sent to other brokers in our environment.
> On further inspection of the code, I am wondering if the following code path 
> could lead to this issue:
> [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L500]
> In this scenario, we don't have any _NodeApiVersions_ available for the 
> specified nodeId and potentially skipping the _latestUsableVersion_ check. I 
> am wondering if it is possible that because _discoverBrokerVersions_ is set 
> to _false_ for the network client of the {_}AddPartitionsToTxnManager{_}, it 
> skips fetching {_}NodeApiVersions{_}? I can see that we create the network 
> client here:
> [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/core/src/main/scala/kafka/server/KafkaServer.scala#L641]
> The _NetworkUtils.buildNetworkClient_ method seems to create a network client 
> that has _discoverBrokerVersions_ set to {_}false{_}. 
> I was hoping I could get some assistance debugging this issue. Happy to 
> provide any additional information needed.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-16692) InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not enabled when upgrading from kafka 3.5 to 3.6

2024-05-15 Thread Justine Olshan (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justine Olshan updated KAFKA-16692:
---
Affects Version/s: 3.7.0

> InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not 
> enabled when upgrading from kafka 3.5 to 3.6 
> 
>
> Key: KAFKA-16692
> URL: https://issues.apache.org/jira/browse/KAFKA-16692
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 3.7.0, 3.6.1
>Reporter: Johnson Okorie
>Assignee: Justine Olshan
>Priority: Major
>
> We have a kafka cluster running on version 3.5.2 that we are upgrading to 
> 3.6.1. This cluster has a lot of clients with exactly one semantics enabled 
> and hence creating transactions. As we replaced brokers with the new 
> binaries, we observed lots of clients in the cluster experiencing the 
> following error:
> {code:java}
> 2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=, 
> transactionalId=] Got error produce response with 
> correlation id 6402937 on topic-partition , retrying 
> (2147483512 attempts left). Error: NETWORK_EXCEPTION. Error Message: The 
> server disconnected before a response was received.{code}
> On inspecting the broker, we saw the following errors on brokers still 
> running Kafka version 3.5.2:
>  
> {code:java}
> message:     
> Closing socket for  because of error
> exception_exception_class:    
> org.apache.kafka.common.errors.InvalidRequestException
> exception_exception_message:    
> Received request api key ADD_PARTITIONS_TO_TXN with version 4 which is not 
> enabled
> exception_stacktrace:    
> org.apache.kafka.common.errors.InvalidRequestException: Received request api 
> key ADD_PARTITIONS_TO_TXN with version 4 which is not enabled
> {code}
> On the new brokers running 3.6.1 we saw the following errors:
>  
> {code:java}
> [AddPartitionsToTxnSenderThread-1055]: AddPartitionsToTxnRequest failed for 
> node 1043 with a network exception.{code}
>  
> I can also see this :
> {code:java}
> [AddPartitionsToTxnManager broker=1055]Cancelled in-flight 
> ADD_PARTITIONS_TO_TXN request with correlation id 21120 due to node 1043 
> being disconnected (elapsed time since creation: 11ms, elapsed time since 
> send: 4ms, request timeout: 3ms){code}
> We started investigating this issue and digging through the changes in 3.6, 
> we came across some changes introduced as part of KAFKA-14402 that we thought 
> might lead to this behaviour. 
> First we could see that _transaction.partition.verification.enable_ is 
> enabled by default and enables a new code path that culminates in we sending 
> version 4 ADD_PARTITIONS_TO_TXN requests to other brokers that are generated 
> [here|https://github.com/apache/kafka/blob/29f3260a9c07e654a28620aeb93a778622a5233d/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L269].
> From a 
> [discussion|https://lists.apache.org/thread/4895wrd1z92kjb708zck4s1f62xq6r8x] 
> on the mailing list, [~jolshan] pointed out that this scenario shouldn't be 
> possible as the following code paths should prevent version 4 
> ADD_PARTITIONS_TO_TXN requests being sent to other brokers:
> [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/clients/src/main/java/org/apache/kafka/clients/NodeApiVersions.java#L130]
>  
> [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L195]
> However, these requests are still sent to other brokers in our environment.
> On further inspection of the code, I am wondering if the following code path 
> could lead to this issue:
> [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L500]
> In this scenario, we don't have any _NodeApiVersions_ available for the 
> specified nodeId and potentially skipping the _latestUsableVersion_ check. I 
> am wondering if it is possible that because _discoverBrokerVersions_ is set 
> to _false_ for the network client of the {_}AddPartitionsToTxnManager{_}, it 
> skips fetching {_}NodeApiVersions{_}? I can see that we create the network 
> client here:
> [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/core/src/main/scala/kafka/server/KafkaServer.scala#L641]
> The _NetworkUtils.buildNetworkClient_ method seems to create a network client 
> that has _discoverBrokerVersions_ set to {_}false{_}. 
> I was hoping I could get some assistance debugging this issue. Happy to 
> provide any additional information needed.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16692) InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not enabled when upgrading from kafka 3.5 to 3.6

2024-05-15 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846742#comment-17846742
 ] 

Justine Olshan commented on KAFKA-16692:


[~akaltsikis] Hmmm...I've included information in release notes about bugs like 
this, but not sure if there is a place to update between releases. Good new 
though is that the PR is almost ready. Just running a final batch of tests.

> InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not 
> enabled when upgrading from kafka 3.5 to 3.6 
> 
>
> Key: KAFKA-16692
> URL: https://issues.apache.org/jira/browse/KAFKA-16692
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 3.6.1
>Reporter: Johnson Okorie
>Assignee: Justine Olshan
>Priority: Major
>
> We have a kafka cluster running on version 3.5.2 that we are upgrading to 
> 3.6.1. This cluster has a lot of clients with exactly one semantics enabled 
> and hence creating transactions. As we replaced brokers with the new 
> binaries, we observed lots of clients in the cluster experiencing the 
> following error:
> {code:java}
> 2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=, 
> transactionalId=] Got error produce response with 
> correlation id 6402937 on topic-partition , retrying 
> (2147483512 attempts left). Error: NETWORK_EXCEPTION. Error Message: The 
> server disconnected before a response was received.{code}
> On inspecting the broker, we saw the following errors on brokers still 
> running Kafka version 3.5.2:
>  
> {code:java}
> message:     
> Closing socket for  because of error
> exception_exception_class:    
> org.apache.kafka.common.errors.InvalidRequestException
> exception_exception_message:    
> Received request api key ADD_PARTITIONS_TO_TXN with version 4 which is not 
> enabled
> exception_stacktrace:    
> org.apache.kafka.common.errors.InvalidRequestException: Received request api 
> key ADD_PARTITIONS_TO_TXN with version 4 which is not enabled
> {code}
> On the new brokers running 3.6.1 we saw the following errors:
>  
> {code:java}
> [AddPartitionsToTxnSenderThread-1055]: AddPartitionsToTxnRequest failed for 
> node 1043 with a network exception.{code}
>  
> I can also see this :
> {code:java}
> [AddPartitionsToTxnManager broker=1055]Cancelled in-flight 
> ADD_PARTITIONS_TO_TXN request with correlation id 21120 due to node 1043 
> being disconnected (elapsed time since creation: 11ms, elapsed time since 
> send: 4ms, request timeout: 3ms){code}
> We started investigating this issue and digging through the changes in 3.6, 
> we came across some changes introduced as part of KAFKA-14402 that we thought 
> might lead to this behaviour. 
> First we could see that _transaction.partition.verification.enable_ is 
> enabled by default and enables a new code path that culminates in we sending 
> version 4 ADD_PARTITIONS_TO_TXN requests to other brokers that are generated 
> [here|https://github.com/apache/kafka/blob/29f3260a9c07e654a28620aeb93a778622a5233d/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L269].
> From a 
> [discussion|https://lists.apache.org/thread/4895wrd1z92kjb708zck4s1f62xq6r8x] 
> on the mailing list, [~jolshan] pointed out that this scenario shouldn't be 
> possible as the following code paths should prevent version 4 
> ADD_PARTITIONS_TO_TXN requests being sent to other brokers:
> [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/clients/src/main/java/org/apache/kafka/clients/NodeApiVersions.java#L130]
>  
> [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L195]
> However, these requests are still sent to other brokers in our environment.
> On further inspection of the code, I am wondering if the following code path 
> could lead to this issue:
> [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L500]
> In this scenario, we don't have any _NodeApiVersions_ available for the 
> specified nodeId and potentially skipping the _latestUsableVersion_ check. I 
> am wondering if it is possible that because _discoverBrokerVersions_ is set 
> to _false_ for the network client of the {_}AddPartitionsToTxnManager{_}, it 
> skips fetching {_}NodeApiVersions{_}? I can see that we create the network 
> client here:
> [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/core/src/main/scala/kafka/server/KafkaServer.scala#L641]
> The _NetworkUtils.buildNetworkClient_ method seems to create a network client 
> that has _discoverBrokerVersions_ set to {_}false{_}. 
> I 

[jira] [Commented] (KAFKA-16692) InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not enabled when upgrading from kafka 3.5 to 3.6

2024-05-15 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846591#comment-17846591
 ] 

Justine Olshan commented on KAFKA-16692:


The delay in the fix is coming from the fact that reproducing the issue in a 
system test is hard because while I can get the error to appear, it is hard to 
consistently get a test to fail due to it. I assure you I'm making some 
progress and hope to have a fix by the end of the week :) 

> InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not 
> enabled when upgrading from kafka 3.5 to 3.6 
> 
>
> Key: KAFKA-16692
> URL: https://issues.apache.org/jira/browse/KAFKA-16692
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 3.6.1
>Reporter: Johnson Okorie
>Assignee: Justine Olshan
>Priority: Major
>
> We have a kafka cluster running on version 3.5.2 that we are upgrading to 
> 3.6.1. This cluster has a lot of clients with exactly one semantics enabled 
> and hence creating transactions. As we replaced brokers with the new 
> binaries, we observed lots of clients in the cluster experiencing the 
> following error:
> {code:java}
> 2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=, 
> transactionalId=] Got error produce response with 
> correlation id 6402937 on topic-partition , retrying 
> (2147483512 attempts left). Error: NETWORK_EXCEPTION. Error Message: The 
> server disconnected before a response was received.{code}
> On inspecting the broker, we saw the following errors on brokers still 
> running Kafka version 3.5.2:
>  
> {code:java}
> message:     
> Closing socket for  because of error
> exception_exception_class:    
> org.apache.kafka.common.errors.InvalidRequestException
> exception_exception_message:    
> Received request api key ADD_PARTITIONS_TO_TXN with version 4 which is not 
> enabled
> exception_stacktrace:    
> org.apache.kafka.common.errors.InvalidRequestException: Received request api 
> key ADD_PARTITIONS_TO_TXN with version 4 which is not enabled
> {code}
> On the new brokers running 3.6.1 we saw the following errors:
>  
> {code:java}
> [AddPartitionsToTxnSenderThread-1055]: AddPartitionsToTxnRequest failed for 
> node 1043 with a network exception.{code}
>  
> I can also see this :
> {code:java}
> [AddPartitionsToTxnManager broker=1055]Cancelled in-flight 
> ADD_PARTITIONS_TO_TXN request with correlation id 21120 due to node 1043 
> being disconnected (elapsed time since creation: 11ms, elapsed time since 
> send: 4ms, request timeout: 3ms){code}
> We started investigating this issue and digging through the changes in 3.6, 
> we came across some changes introduced as part of KAFKA-14402 that we thought 
> might lead to this behaviour. 
> First we could see that _transaction.partition.verification.enable_ is 
> enabled by default and enables a new code path that culminates in we sending 
> version 4 ADD_PARTITIONS_TO_TXN requests to other brokers that are generated 
> [here|https://github.com/apache/kafka/blob/29f3260a9c07e654a28620aeb93a778622a5233d/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L269].
> From a 
> [discussion|https://lists.apache.org/thread/4895wrd1z92kjb708zck4s1f62xq6r8x] 
> on the mailing list, [~jolshan] pointed out that this scenario shouldn't be 
> possible as the following code paths should prevent version 4 
> ADD_PARTITIONS_TO_TXN requests being sent to other brokers:
> [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/clients/src/main/java/org/apache/kafka/clients/NodeApiVersions.java#L130]
>  
> [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L195]
> However, these requests are still sent to other brokers in our environment.
> On further inspection of the code, I am wondering if the following code path 
> could lead to this issue:
> [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L500]
> In this scenario, we don't have any _NodeApiVersions_ available for the 
> specified nodeId and potentially skipping the _latestUsableVersion_ check. I 
> am wondering if it is possible that because _discoverBrokerVersions_ is set 
> to _false_ for the network client of the {_}AddPartitionsToTxnManager{_}, it 
> skips fetching {_}NodeApiVersions{_}? I can see that we create the network 
> client here:
> [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/core/src/main/scala/kafka/server/KafkaServer.scala#L641]
> The _NetworkUtils.buildNetworkClient_ method seems to create a network client 
> that 

[jira] [Commented] (KAFKA-16692) InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not enabled when upgrading from kafka 3.5 to 3.6

2024-05-15 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846590#comment-17846590
 ] 

Justine Olshan commented on KAFKA-16692:


Hi [~akaltsikis]. Yes the bug is due to the verification requests during the 
upgrade. If you disable the verification, you won't hit the bug.

However, the feature was designed to allow upgrades, and there seems to be a 
small bug in the logic to gate the requests during the upgrade. I'm working on 
fixing that.

In the meantime, disabling the feature on upgrade is a work around.

> InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not 
> enabled when upgrading from kafka 3.5 to 3.6 
> 
>
> Key: KAFKA-16692
> URL: https://issues.apache.org/jira/browse/KAFKA-16692
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 3.6.1
>Reporter: Johnson Okorie
>Assignee: Justine Olshan
>Priority: Major
>
> We have a kafka cluster running on version 3.5.2 that we are upgrading to 
> 3.6.1. This cluster has a lot of clients with exactly one semantics enabled 
> and hence creating transactions. As we replaced brokers with the new 
> binaries, we observed lots of clients in the cluster experiencing the 
> following error:
> {code:java}
> 2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=, 
> transactionalId=] Got error produce response with 
> correlation id 6402937 on topic-partition , retrying 
> (2147483512 attempts left). Error: NETWORK_EXCEPTION. Error Message: The 
> server disconnected before a response was received.{code}
> On inspecting the broker, we saw the following errors on brokers still 
> running Kafka version 3.5.2:
>  
> {code:java}
> message:     
> Closing socket for  because of error
> exception_exception_class:    
> org.apache.kafka.common.errors.InvalidRequestException
> exception_exception_message:    
> Received request api key ADD_PARTITIONS_TO_TXN with version 4 which is not 
> enabled
> exception_stacktrace:    
> org.apache.kafka.common.errors.InvalidRequestException: Received request api 
> key ADD_PARTITIONS_TO_TXN with version 4 which is not enabled
> {code}
> On the new brokers running 3.6.1 we saw the following errors:
>  
> {code:java}
> [AddPartitionsToTxnSenderThread-1055]: AddPartitionsToTxnRequest failed for 
> node 1043 with a network exception.{code}
>  
> I can also see this :
> {code:java}
> [AddPartitionsToTxnManager broker=1055]Cancelled in-flight 
> ADD_PARTITIONS_TO_TXN request with correlation id 21120 due to node 1043 
> being disconnected (elapsed time since creation: 11ms, elapsed time since 
> send: 4ms, request timeout: 3ms){code}
> We started investigating this issue and digging through the changes in 3.6, 
> we came across some changes introduced as part of KAFKA-14402 that we thought 
> might lead to this behaviour. 
> First we could see that _transaction.partition.verification.enable_ is 
> enabled by default and enables a new code path that culminates in we sending 
> version 4 ADD_PARTITIONS_TO_TXN requests to other brokers that are generated 
> [here|https://github.com/apache/kafka/blob/29f3260a9c07e654a28620aeb93a778622a5233d/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L269].
> From a 
> [discussion|https://lists.apache.org/thread/4895wrd1z92kjb708zck4s1f62xq6r8x] 
> on the mailing list, [~jolshan] pointed out that this scenario shouldn't be 
> possible as the following code paths should prevent version 4 
> ADD_PARTITIONS_TO_TXN requests being sent to other brokers:
> [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/clients/src/main/java/org/apache/kafka/clients/NodeApiVersions.java#L130]
>  
> [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L195]
> However, these requests are still sent to other brokers in our environment.
> On further inspection of the code, I am wondering if the following code path 
> could lead to this issue:
> [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L500]
> In this scenario, we don't have any _NodeApiVersions_ available for the 
> specified nodeId and potentially skipping the _latestUsableVersion_ check. I 
> am wondering if it is possible that because _discoverBrokerVersions_ is set 
> to _false_ for the network client of the {_}AddPartitionsToTxnManager{_}, it 
> skips fetching {_}NodeApiVersions{_}? I can see that we create the network 
> client here:
> 

[jira] [Commented] (KAFKA-16692) InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not enabled when upgrading from kafka 3.5 to 3.6

2024-05-13 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846032#comment-17846032
 ] 

Justine Olshan commented on KAFKA-16692:


Hey [~j-okorie] I'm still looking at this. Got a way to reproduce the upgrade 
in the test, but the test still passes likely due to having enough time to 
write the messages. I will tweak it and see if I can get it to fail and work 
with the fix.

> InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not 
> enabled when upgrading from kafka 3.5 to 3.6 
> 
>
> Key: KAFKA-16692
> URL: https://issues.apache.org/jira/browse/KAFKA-16692
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 3.6.1
>Reporter: Johnson Okorie
>Assignee: Justine Olshan
>Priority: Major
>
> We have a kafka cluster running on version 3.5.2 that we are upgrading to 
> 3.6.1. This cluster has a lot of clients with exactly one semantics enabled 
> and hence creating transactions. As we replaced brokers with the new 
> binaries, we observed lots of clients in the cluster experiencing the 
> following error:
> {code:java}
> 2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=, 
> transactionalId=] Got error produce response with 
> correlation id 6402937 on topic-partition , retrying 
> (2147483512 attempts left). Error: NETWORK_EXCEPTION. Error Message: The 
> server disconnected before a response was received.{code}
> On inspecting the broker, we saw the following errors on brokers still 
> running Kafka version 3.5.2:
>  
> {code:java}
> message:     
> Closing socket for  because of error
> exception_exception_class:    
> org.apache.kafka.common.errors.InvalidRequestException
> exception_exception_message:    
> Received request api key ADD_PARTITIONS_TO_TXN with version 4 which is not 
> enabled
> exception_stacktrace:    
> org.apache.kafka.common.errors.InvalidRequestException: Received request api 
> key ADD_PARTITIONS_TO_TXN with version 4 which is not enabled
> {code}
> On the new brokers running 3.6.1 we saw the following errors:
>  
> {code:java}
> [AddPartitionsToTxnSenderThread-1055]: AddPartitionsToTxnRequest failed for 
> node 1043 with a network exception.{code}
>  
> I can also see this :
> {code:java}
> [AddPartitionsToTxnManager broker=1055]Cancelled in-flight 
> ADD_PARTITIONS_TO_TXN request with correlation id 21120 due to node 1043 
> being disconnected (elapsed time since creation: 11ms, elapsed time since 
> send: 4ms, request timeout: 3ms){code}
> We started investigating this issue and digging through the changes in 3.6, 
> we came across some changes introduced as part of KAFKA-14402 that we thought 
> might lead to this behaviour. 
> First we could see that _transaction.partition.verification.enable_ is 
> enabled by default and enables a new code path that culminates in we sending 
> version 4 ADD_PARTITIONS_TO_TXN requests to other brokers that are generated 
> [here|https://github.com/apache/kafka/blob/29f3260a9c07e654a28620aeb93a778622a5233d/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L269].
> From a 
> [discussion|https://lists.apache.org/thread/4895wrd1z92kjb708zck4s1f62xq6r8x] 
> on the mailing list, [~jolshan] pointed out that this scenario shouldn't be 
> possible as the following code paths should prevent version 4 
> ADD_PARTITIONS_TO_TXN requests being sent to other brokers:
> [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/clients/src/main/java/org/apache/kafka/clients/NodeApiVersions.java#L130]
>  
> [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L195]
> However, these requests are still sent to other brokers in our environment.
> On further inspection of the code, I am wondering if the following code path 
> could lead to this issue:
> [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L500]
> In this scenario, we don't have any _NodeApiVersions_ available for the 
> specified nodeId and potentially skipping the _latestUsableVersion_ check. I 
> am wondering if it is possible that because _discoverBrokerVersions_ is set 
> to _false_ for the network client of the {_}AddPartitionsToTxnManager{_}, it 
> skips fetching {_}NodeApiVersions{_}? I can see that we create the network 
> client here:
> [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/core/src/main/scala/kafka/server/KafkaServer.scala#L641]
> The _NetworkUtils.buildNetworkClient_ method seems to create a network client 
> that has _discoverBrokerVersions_ set to 

[jira] [Resolved] (KAFKA-16513) Allow WriteTxnMarkers API with Alter Cluster Permission

2024-05-10 Thread Justine Olshan (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justine Olshan resolved KAFKA-16513.

Resolution: Fixed

> Allow WriteTxnMarkers API with Alter Cluster Permission
> ---
>
> Key: KAFKA-16513
> URL: https://issues.apache.org/jira/browse/KAFKA-16513
> Project: Kafka
>  Issue Type: Improvement
>  Components: admin
>Reporter: Nikhil Ramakrishnan
>Assignee: Siddharth Yagnik
>Priority: Minor
>  Labels: KIP-1037
> Fix For: 3.8.0
>
>
> We should allow WriteTxnMarkers API with Alter Cluster Permission because it 
> can invoked externally by a Kafka AdminClient. Such usage is more aligned 
> with the Alter permission on the Cluster resource, which includes other 
> administrative actions invoked from the Kafka AdminClient.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16699) Have Streams treat InvalidPidMappingException like a ProducerFencedException

2024-05-10 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845497#comment-17845497
 ] 

Justine Olshan commented on KAFKA-16699:


Yay yay! I'm happy this is getting fixed :) 

> Have Streams treat InvalidPidMappingException like a ProducerFencedException
> 
>
> Key: KAFKA-16699
> URL: https://issues.apache.org/jira/browse/KAFKA-16699
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Walker Carlson
>Assignee: Walker Carlson
>Priority: Major
>
> KStreams is able to handle the ProducerFenced (among other errors) cleanly. 
> It does this by closing the task dirty and triggering a rebalance amongst the 
> worker threads to rejoin the group. The producer is also recreated. Due to 
> how streams works (writing to and reading from various topics), the 
> application is able to figure out the last thing the fenced producer 
> completed and continue from there.
> KStreams EOS V2 also trusts that any open transaction (including those whose 
> producer is fenced) will be aborted by the server. This is a key factor in 
> how it is able to operate. In EOS V1, the new InitProducerId fences and 
> aborts the previous transaction. In either case, we are able to reason about 
> the last valid state from the fenced producer and how to proceed.
> h2. InvalidPidMappingException ≈ ProducerFenced
> I argue that InvalidPidMappingException can be handled in the same way. Let 
> me explain why.
> There are two cases we see this error:
>  # 
>  
>  {{txnManager.getTransactionState(transactionalId).flatMap {  case None => 
> Left(Errors.INVALID_PRODUCER_ID_MAPPING)}}
>  # 
>  
>  {{if (txnMetadata.producerId != producerId)  
> Left(Errors.INVALID_PRODUCER_ID_MAPPING)}}
> h3. Case 1
> We are missing a value in the transactional state map for the transactional 
> ID. Under normal operations, this is only possible when the transactional ID 
> expires via the mechanism described above after 
> {{transactional.id.expiration.ms}} of inactivity. In this case, there is no 
> state that needs to be reconciled. It is safe to just rebalance and rejoin 
> the group with a new producer. We probably don’t even need to close the task 
> dirty, but it doesn’t hurt to do so.
> h3. Case 2
> This is a bit more interesting. It says that we have transactional state, but 
> the producer ID in the request does not match the producer ID associated with 
> the transactional ID on the broker. How can this happen?
> It is possible that a new producer instance B with the same transactional ID 
> was created after the transactional state expired for instance A. Given there 
> is no state on the server when B joins, it will get a totally new producer 
> ID. If the original producer A comes back, it will have state for this 
> transactional ID but the wrong producer ID.
> In this case, the old producer ID is fenced, it’s just the normal epoch-based 
> fencing logic doesn’t apply. We can treat it the same however.
> h2. Summary
> As described in the cases above, any time we encounter the InvalidPidMapping 
> during normal operation, the previous producer was either finished with its 
> operations or was fenced. Thus, it is safe to close the dirty and rebalance + 
> rejoin the group just as we do with the ProducerFenced exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16692) InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not enabled when upgrading from kafka 3.5 to 3.6

2024-05-09 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845004#comment-17845004
 ] 

Justine Olshan commented on KAFKA-16692:


I'm working on a test to recreate the issue, and then seeing if the proposed 
fix helps.

> InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not 
> enabled when upgrading from kafka 3.5 to 3.6 
> 
>
> Key: KAFKA-16692
> URL: https://issues.apache.org/jira/browse/KAFKA-16692
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 3.6.1
>Reporter: Johnson Okorie
>Assignee: Justine Olshan
>Priority: Major
>
> We have a kafka cluster running on version 3.5.2 that we are upgrading to 
> 3.6.1. This cluster has a lot of clients with exactly one semantics enabled 
> and hence creating transactions. As we replaced brokers with the new 
> binaries, we observed lots of clients in the cluster experiencing the 
> following error:
> {code:java}
> 2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=, 
> transactionalId=] Got error produce response with 
> correlation id 6402937 on topic-partition , retrying 
> (2147483512 attempts left). Error: NETWORK_EXCEPTION. Error Message: The 
> server disconnected before a response was received.{code}
> On inspecting the broker, we saw the following errors on brokers still 
> running Kafka version 3.5.2:
>  
> {code:java}
> message:     
> Closing socket for  because of error
> exception_exception_class:    
> org.apache.kafka.common.errors.InvalidRequestException
> exception_exception_message:    
> Received request api key ADD_PARTITIONS_TO_TXN with version 4 which is not 
> enabled
> exception_stacktrace:    
> org.apache.kafka.common.errors.InvalidRequestException: Received request api 
> key ADD_PARTITIONS_TO_TXN with version 4 which is not enabled
> {code}
> On the new brokers running 3.6.1 we saw the following errors:
>  
> {code:java}
> [AddPartitionsToTxnSenderThread-1055]: AddPartitionsToTxnRequest failed for 
> node 1043 with a network exception.{code}
>  
> I can also see this :
> {code:java}
> [AddPartitionsToTxnManager broker=1055]Cancelled in-flight 
> ADD_PARTITIONS_TO_TXN request with correlation id 21120 due to node 1043 
> being disconnected (elapsed time since creation: 11ms, elapsed time since 
> send: 4ms, request timeout: 3ms){code}
> We started investigating this issue and digging through the changes in 3.6, 
> we came across some changes introduced as part of KAFKA-14402 that we thought 
> might lead to this behaviour. 
> First we could see that _transaction.partition.verification.enable_ is 
> enabled by default and enables a new code path that culminates in we sending 
> version 4 ADD_PARTITIONS_TO_TXN requests to other brokers that are generated 
> [here|https://github.com/apache/kafka/blob/29f3260a9c07e654a28620aeb93a778622a5233d/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L269].
> From a 
> [discussion|https://lists.apache.org/thread/4895wrd1z92kjb708zck4s1f62xq6r8x] 
> on the mailing list, [~jolshan] pointed out that this scenario shouldn't be 
> possible as the following code paths should prevent version 4 
> ADD_PARTITIONS_TO_TXN requests being sent to other brokers:
> [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/clients/src/main/java/org/apache/kafka/clients/NodeApiVersions.java#L130]
>  
> [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L195]
> However, these requests are still sent to other brokers in our environment.
> On further inspection of the code, I am wondering if the following code path 
> could lead to this issue:
> [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L500]
> In this scenario, we don't have any _NodeApiVersions_ available for the 
> specified nodeId and potentially skipping _latestUsableVersion_ check as 
> expected. I am wondering if it is possible that because 
> _discoverBrokerVersions_ is set to _false_ for the network client of the 
> {_}AddPartitionsToTxnManager{_}, it skips fetching {_}NodeApiVersions{_}? I 
> can see that we create the network client here:
> [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/core/src/main/scala/kafka/server/KafkaServer.scala#L641]
> The _NetworkUtils.buildNetworkClient_ method seems to create a network client 
> that has _discoverBrokerVersions_ set to {_}false{_}. 
> I was hoping I could get some assistance debugging this issue. Happy to 
> provide any additional information needed.
>  
>  
>  



--
This 

[jira] [Commented] (KAFKA-16264) Expose `producer.id.expiration.check.interval.ms` as dynamic broker configuration

2024-05-08 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844721#comment-17844721
 ] 

Justine Olshan commented on KAFKA-16264:


Thanks. I will take a look. 
It just occurred to me that such a change may require a small KIP. 
[https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals]
Let me double check and get back to you. 

> Expose `producer.id.expiration.check.interval.ms` as dynamic broker 
> configuration
> -
>
> Key: KAFKA-16264
> URL: https://issues.apache.org/jira/browse/KAFKA-16264
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Jorge Esteban Quilcate Otoya
>Assignee: Jorge Esteban Quilcate Otoya
>Priority: Major
>
> Dealing with a scenario where too many producer ids lead to issues (e.g. high 
> cpu utilization, see KAFKA-16229) put operators in need to flush producer ids 
> more promptly than usual.
> Currently, only the expiration timeout `producer.id.expiration.ms` is exposed 
> as dynamic config. This is helpful (e.g. by reducing the timeout, less 
> producer would eventually be kept in memory), but not enough if the 
> evaluation frequency is not sufficiently short to flush producer ids before 
> becoming an issue. Only by tuning both, the issue could be workaround.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (KAFKA-16692) InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not enabled when

2024-05-08 Thread Justine Olshan (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justine Olshan reassigned KAFKA-16692:
--

Assignee: Justine Olshan

> InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not 
> enabled when 
> 
>
> Key: KAFKA-16692
> URL: https://issues.apache.org/jira/browse/KAFKA-16692
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 3.6.1
>Reporter: Johnson Okorie
>Assignee: Justine Olshan
>Priority: Major
>
> We have a kafka cluster running on version 3.5.2 that we are upgrading to 
> 3.6.1. This cluster has a lot of clients with exactly one semantics enabled 
> and hence creating transactions. As we replaced brokers with the new 
> binaries, we observed lots of clients in the cluster experiencing the 
> following error:
> {code:java}
> 2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=, 
> transactionalId=] Got error produce response with 
> correlation id 6402937 on topic-partition , retrying 
> (2147483512 attempts left). Error: NETWORK_EXCEPTION. Error Message: The 
> server disconnected before a response was received.{code}
> On inspecting the broker, we saw the following errors on brokers still 
> running Kafka version 3.5.2:
>  
> {code:java}
> message:     
> Closing socket for  because of error
> exception_exception_class:    
> org.apache.kafka.common.errors.InvalidRequestException
> exception_exception_message:    
> Received request api key ADD_PARTITIONS_TO_TXN with version 4 which is not 
> enabled
> exception_stacktrace:    
> org.apache.kafka.common.errors.InvalidRequestException: Received request api 
> key ADD_PARTITIONS_TO_TXN with version 4 which is not enabled
> {code}
> On the new brokers running 3.6.1 we saw the following errors:
>  
> {code:java}
> [AddPartitionsToTxnSenderThread-1055]: AddPartitionsToTxnRequest failed for 
> node 1043 with a network exception.{code}
>  
> I can also see this :
> {code:java}
> [AddPartitionsToTxnManager broker=1055]Cancelled in-flight 
> ADD_PARTITIONS_TO_TXN request with correlation id 21120 due to node 1043 
> being disconnected (elapsed time since creation: 11ms, elapsed time since 
> send: 4ms, request timeout: 3ms){code}
> We started investigating this issue and digging through the changes in 3.6, 
> we came across some changes introduced as part of KAFKA-14402 that we thought 
> might lead to this behaviour. 
> First we could see that _transaction.partition.verification.enable_ is 
> enabled by default and enables a new code path that culminates in we sending 
> version 4 ADD_PARTITIONS_TO_TXN requests to other brokers that are generated 
> [here|#L269]].
> From a 
> [discussion|https://lists.apache.org/thread/4895wrd1z92kjb708zck4s1f62xq6r8x] 
> on the mailing list, [~jolshan] pointed out that this scenario shouldn't be 
> possible as the following code paths should prevent version 4 
> ADD_PARTITIONS_TO_TXN requests being sent to other brokers:
> [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/clients/src/main/java/org/apache/kafka/clients/NodeApiVersions.java#L130]
>  
> [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L195]
> However, this seems to be these requests are still sent to other brokers in 
> our environment.
> On further inspection of the code, I am wondering if the following code path 
> could lead to this issue:
> [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L500]
> In this scenario, we don't have any _NodeApiVersions_ available for the 
> specified nodeId and potentially skipping _latestUsableVersion_ check as 
> expected. I am wondering if it is possible that because 
> _discoverBrokerVersions_ is set to false for the network client of the 
> AddPartitionsToTxnManager, it skips fetching ApiVersions? I can see here that 
> we create the network client here:
> [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/core/src/main/scala/kafka/server/KafkaServer.scala#L641]
> This _NetworkUtils.buildNetworkClient_ seems to create a network client that 
> has _discoverBrokerVersions_ set to false. 
> I was hoping I could get some assistance debugging this issue. Happy to 
> provide any additional information needed.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16692) InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not enabled when

2024-05-08 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844708#comment-17844708
 ] 

Justine Olshan commented on KAFKA-16692:


Hi [~j-okorie]! Thanks for filing the ticket and the additional debugging. I 
will look at this today. 

> InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not 
> enabled when 
> 
>
> Key: KAFKA-16692
> URL: https://issues.apache.org/jira/browse/KAFKA-16692
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 3.6.1
>Reporter: Johnson Okorie
>Priority: Major
>
> We have a kafka cluster running on version 3.5.2 that we are upgrading to 
> 3.6.1. This cluster has a lot of clients with exactly one semantics enabled 
> and hence creating transactions. As we replaced brokers with the new 
> binaries, we observed lots of clients in the cluster experiencing the 
> following error:
> {code:java}
> 2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=, 
> transactionalId=] Got error produce response with 
> correlation id 6402937 on topic-partition , retrying 
> (2147483512 attempts left). Error: NETWORK_EXCEPTION. Error Message: The 
> server disconnected before a response was received.{code}
> On inspecting the broker, we saw the following errors on brokers still 
> running Kafka version 3.5.2:
>  
> {code:java}
> message:     
> Closing socket for  because of error
> exception_exception_class:    
> org.apache.kafka.common.errors.InvalidRequestException
> exception_exception_message:    
> Received request api key ADD_PARTITIONS_TO_TXN with version 4 which is not 
> enabled
> exception_stacktrace:    
> org.apache.kafka.common.errors.InvalidRequestException: Received request api 
> key ADD_PARTITIONS_TO_TXN with version 4 which is not enabled
> {code}
> On the new brokers running 3.6.1 we saw the following errors:
>  
> {code:java}
> [AddPartitionsToTxnSenderThread-1055]: AddPartitionsToTxnRequest failed for 
> node 1043 with a network exception.{code}
>  
> I can also see this :
> {code:java}
> [AddPartitionsToTxnManager broker=1055]Cancelled in-flight 
> ADD_PARTITIONS_TO_TXN request with correlation id 21120 due to node 1043 
> being disconnected (elapsed time since creation: 11ms, elapsed time since 
> send: 4ms, request timeout: 3ms){code}
> We started investigating this issue and digging through the changes in 3.6, 
> we came across some changes introduced as part of KAFKA-14402 that we thought 
> might lead to this behaviour. 
> First we could see that _transaction.partition.verification.enable_ is 
> enabled by default and enables a new code path that culminates in we sending 
> version 4 ADD_PARTITIONS_TO_TXN requests to other brokers that are generated 
> [here|#L269]].
> From a 
> [discussion|https://lists.apache.org/thread/4895wrd1z92kjb708zck4s1f62xq6r8x] 
> on the mailing list, [~jolshan] pointed out that this scenario shouldn't be 
> possible as the following code paths should prevent version 4 
> ADD_PARTITIONS_TO_TXN requests being sent to other brokers:
> [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/clients/src/main/java/org/apache/kafka/clients/NodeApiVersions.java#L130]
>  
> [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L195]
> However, this seems to be these requests are still sent to other brokers in 
> our environment.
> On further inspection of the code, I am wondering if the following code path 
> could lead to this issue:
> [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L500]
> In this scenario, we don't have any _NodeApiVersions_ available for the 
> specified nodeId and potentially skipping _latestUsableVersion_ check as 
> expected. I am wondering if it is possible that because 
> _discoverBrokerVersions_ is set to false for the network client of the 
> AddPartitionsToTxnManager, it skips fetching ApiVersions? I can see here that 
> we create the network client here:
> [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/core/src/main/scala/kafka/server/KafkaServer.scala#L641]
> This _NetworkUtils.buildNetworkClient_ seems to create a network client that 
> has _discoverBrokerVersions_ set to false. 
> I was hoping I could get some assistance debugging this issue. Happy to 
> provide any additional information needed.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-16386) NETWORK_EXCEPTIONs from transaction verification are not translated

2024-04-25 Thread Justine Olshan (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justine Olshan updated KAFKA-16386:
---
Fix Version/s: 3.7.1

> NETWORK_EXCEPTIONs from transaction verification are not translated
> ---
>
> Key: KAFKA-16386
> URL: https://issues.apache.org/jira/browse/KAFKA-16386
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.6.0, 3.7.0
>Reporter: Sean Quah
>Priority: Minor
> Fix For: 3.8.0, 3.7.1
>
>
> KAFKA-14402 
> ([KIP-890|https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense])
>  adds verification with the transaction coordinator on Produce and 
> TxnOffsetCommit paths as a defense against hanging transactions. For 
> compatibility with older clients, retriable errors from the verification step 
> are translated to ones already expected and handled by existing clients. When 
> verification was added, we forgot to translate {{NETWORK_EXCEPTION}} s.
> [~dajac] noticed this manifesting as a test failure when 
> tests/kafkatest/tests/core/transactions_test.py was run with an older client 
> (prior to the fix for KAFKA-16122):
> {quote}
> {{NETWORK_EXCEPTION}} is indeed returned as a partition error. The 
> {{TransactionManager.TxnOffsetCommitHandler}} considers it as a fatal error 
> so it transitions to the fatal state.
> It seems that there are two cases where the server could return it: (1) When 
> the verification request times out or its connections is cut; or (2) in 
> {{AddPartitionsToTxnManager.addTxnData}} where we say that we use it because 
> we want a retriable error.
> {quote}
> The first case was triggered as part of the test. The second case happens 
> when there is already a verification request ({{AddPartitionsToTxn}}) in 
> flight with the same epoch and we want clients to try again when we're not 
> busy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16570) FenceProducers API returns "unexpected error" when successful

2024-04-18 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838796#comment-17838796
 ] 

Justine Olshan commented on KAFKA-16570:


Hmmm – retries here are a bit strange since the command was successful, we will 
just return this error until we can allocate the new producer ID. I suppose 
this "guarantees" the markers were written, but it is not consistent with how 
end txn usually works.

See how the error is return on success of the EndTxn api call? 
[https://github.com/apache/kafka/blob/53ff1a5a589eb0e30a302724fcf5d0c72c823682/core/src/main/scala/kafka/coordinator/transaction/TransactionCoordinator.scala#L175]
 

> FenceProducers API returns "unexpected error" when successful
> -
>
> Key: KAFKA-16570
> URL: https://issues.apache.org/jira/browse/KAFKA-16570
> Project: Kafka
>  Issue Type: Bug
>Reporter: Justine Olshan
>Assignee: Justine Olshan
>Priority: Major
>
> When we want to fence a producer using the admin client, we send an 
> InitProducerId request.
> There is logic in that API to fence (and abort) any ongoing transactions and 
> that is what the API relies on to fence the producer. However, this handling 
> also returns CONCURRENT_TRANSACTIONS. In normal usage, this is good because 
> we want to actually get a new producer ID and want to retry until the the ID 
> is supplied or we time out.  
> [https://github.com/apache/kafka/blob/5193eb93237ba9093ae444d73a1eaa2d6abcc9c1/core/src/main/scala/kafka/coordinator/transaction/TransactionCoordinator.scala#L170]
>  
> [https://github.com/apache/kafka/blob/a3dcbd4e28a35f79f75ec1bf316ef0b39c0df164/clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java#L1322]
>  
> In the case of fence producer, we don't retry and instead we have no handling 
> for concurrent transactions and log a message about an unexpected error.
> [https://github.com/apache/kafka/blob/a3dcbd4e28a35f79f75ec1bf316ef0b39c0df164/clients/src/main/java/org/apache/kafka/clients/admin/internals/FenceProducersHandler.java#L112]
>  
> This is not unexpected though and the operation was successful. We should 
> just swallow this error and treat this as a successful run of the command. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-16386) NETWORK_EXCEPTIONs from transaction verification are not translated

2024-04-18 Thread Justine Olshan (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justine Olshan updated KAFKA-16386:
---
Affects Version/s: 3.7.0

> NETWORK_EXCEPTIONs from transaction verification are not translated
> ---
>
> Key: KAFKA-16386
> URL: https://issues.apache.org/jira/browse/KAFKA-16386
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.6.0, 3.7.0
>Reporter: Sean Quah
>Priority: Minor
> Fix For: 3.8.0
>
>
> KAFKA-14402 
> ([KIP-890|https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense])
>  adds verification with the transaction coordinator on Produce and 
> TxnOffsetCommit paths as a defense against hanging transactions. For 
> compatibility with older clients, retriable errors from the verification step 
> are translated to ones already expected and handled by existing clients. When 
> verification was added, we forgot to translate {{NETWORK_EXCEPTION}} s.
> [~dajac] noticed this manifesting as a test failure when 
> tests/kafkatest/tests/core/transactions_test.py was run with an older client 
> (prior to the fix for KAFKA-16122):
> {quote}
> {{NETWORK_EXCEPTION}} is indeed returned as a partition error. The 
> {{TransactionManager.TxnOffsetCommitHandler}} considers it as a fatal error 
> so it transitions to the fatal state.
> It seems that there are two cases where the server could return it: (1) When 
> the verification request times out or its connections is cut; or (2) in 
> {{AddPartitionsToTxnManager.addTxnData}} where we say that we use it because 
> we want a retriable error.
> {quote}
> The first case was triggered as part of the test. The second case happens 
> when there is already a verification request ({{AddPartitionsToTxn}}) in 
> flight with the same epoch and we want clients to try again when we're not 
> busy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16386) NETWORK_EXCEPTIONs from transaction verification are not translated

2024-04-18 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838768#comment-17838768
 ] 

Justine Olshan commented on KAFKA-16386:


Note: For 3.6, this is only returned in the produce path which treats the 
exception as retriable. The TxnOffsetCommit path is the one that experiences 
the fatal error and is only in 3.7.

> NETWORK_EXCEPTIONs from transaction verification are not translated
> ---
>
> Key: KAFKA-16386
> URL: https://issues.apache.org/jira/browse/KAFKA-16386
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.6.0, 3.7.0
>Reporter: Sean Quah
>Priority: Minor
> Fix For: 3.8.0
>
>
> KAFKA-14402 
> ([KIP-890|https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense])
>  adds verification with the transaction coordinator on Produce and 
> TxnOffsetCommit paths as a defense against hanging transactions. For 
> compatibility with older clients, retriable errors from the verification step 
> are translated to ones already expected and handled by existing clients. When 
> verification was added, we forgot to translate {{NETWORK_EXCEPTION}} s.
> [~dajac] noticed this manifesting as a test failure when 
> tests/kafkatest/tests/core/transactions_test.py was run with an older client 
> (prior to the fix for KAFKA-16122):
> {quote}
> {{NETWORK_EXCEPTION}} is indeed returned as a partition error. The 
> {{TransactionManager.TxnOffsetCommitHandler}} considers it as a fatal error 
> so it transitions to the fatal state.
> It seems that there are two cases where the server could return it: (1) When 
> the verification request times out or its connections is cut; or (2) in 
> {{AddPartitionsToTxnManager.addTxnData}} where we say that we use it because 
> we want a retriable error.
> {quote}
> The first case was triggered as part of the test. The second case happens 
> when there is already a verification request ({{AddPartitionsToTxn}}) in 
> flight with the same epoch and we want clients to try again when we're not 
> busy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-16570) FenceProducers API returns "unexpected error" when successful

2024-04-17 Thread Justine Olshan (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justine Olshan updated KAFKA-16570:
---
Description: 
When we want to fence a producer using the admin client, we send an 
InitProducerId request.

There is logic in that API to fence (and abort) any ongoing transactions and 
that is what the API relies on to fence the producer. However, this handling 
also returns CONCURRENT_TRANSACTIONS. In normal usage, this is good because we 
want to actually get a new producer ID and want to retry until the the ID is 
supplied or we time out.  
[https://github.com/apache/kafka/blob/5193eb93237ba9093ae444d73a1eaa2d6abcc9c1/core/src/main/scala/kafka/coordinator/transaction/TransactionCoordinator.scala#L170]
 
[https://github.com/apache/kafka/blob/a3dcbd4e28a35f79f75ec1bf316ef0b39c0df164/clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java#L1322]
 

In the case of fence producer, we don't retry and instead we have no handling 
for concurrent transactions and log a message about an unexpected error.
[https://github.com/apache/kafka/blob/a3dcbd4e28a35f79f75ec1bf316ef0b39c0df164/clients/src/main/java/org/apache/kafka/clients/admin/internals/FenceProducersHandler.java#L112]
 

This is not unexpected though and the operation was successful. We should just 
swallow this error and treat this as a successful run of the command. 

  was:
When we want to fence a producer using the admin client, we send an 
InitProducerId request.

There is logic in that API to fence (and abort) any ongoing transactions and 
that is what the API relies on to fence the producer. However, this handling 
also returns CONCURRENT_TRANSACTIONS. In normal usage, this is good because we 
want to actually get a new producer ID and want to retry until the the ID is 
supplied or we time out.  
[https://github.com/apache/kafka/blob/5193eb93237ba9093ae444d73a1eaa2d6abcc9c1/core/src/main/scala/kafka/coordinator/transaction/TransactionCoordinator.scala#L170]
 
[https://github.com/confluentinc/ce-kafka/blob/b626db8bd94fe971adef3551518761a7be7de454/clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java#L1322]
 

In the case of fence producer, we don't retry and instead we have no handling 
for concurrent transactions and log a message about an unexpected error.
[https://github.com/confluentinc/ce-kafka/blob/b626db8bd94fe971adef3551518761a7be7de454/clients/src/main/java/org/apache/kafka/clients/admin/internals/FenceProducersHandler.java#L112]
 

This is not unexpected though and the operation was successful. We should just 
swallow this error and treat this as a successful run of the command. 


> FenceProducers API returns "unexpected error" when successful
> -
>
> Key: KAFKA-16570
> URL: https://issues.apache.org/jira/browse/KAFKA-16570
> Project: Kafka
>  Issue Type: Bug
>Reporter: Justine Olshan
>Assignee: Justine Olshan
>Priority: Major
>
> When we want to fence a producer using the admin client, we send an 
> InitProducerId request.
> There is logic in that API to fence (and abort) any ongoing transactions and 
> that is what the API relies on to fence the producer. However, this handling 
> also returns CONCURRENT_TRANSACTIONS. In normal usage, this is good because 
> we want to actually get a new producer ID and want to retry until the the ID 
> is supplied or we time out.  
> [https://github.com/apache/kafka/blob/5193eb93237ba9093ae444d73a1eaa2d6abcc9c1/core/src/main/scala/kafka/coordinator/transaction/TransactionCoordinator.scala#L170]
>  
> [https://github.com/apache/kafka/blob/a3dcbd4e28a35f79f75ec1bf316ef0b39c0df164/clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java#L1322]
>  
> In the case of fence producer, we don't retry and instead we have no handling 
> for concurrent transactions and log a message about an unexpected error.
> [https://github.com/apache/kafka/blob/a3dcbd4e28a35f79f75ec1bf316ef0b39c0df164/clients/src/main/java/org/apache/kafka/clients/admin/internals/FenceProducersHandler.java#L112]
>  
> This is not unexpected though and the operation was successful. We should 
> just swallow this error and treat this as a successful run of the command. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (KAFKA-16571) reassign_partitions_test.bounce_brokers should wait for messages to be sent to every partition

2024-04-16 Thread Justine Olshan (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justine Olshan reassigned KAFKA-16571:
--

Assignee: Justine Olshan

> reassign_partitions_test.bounce_brokers should wait for messages to be sent 
> to every partition
> --
>
> Key: KAFKA-16571
> URL: https://issues.apache.org/jira/browse/KAFKA-16571
> Project: Kafka
>  Issue Type: Bug
>Reporter: David Mao
>Assignee: Justine Olshan
>Priority: Major
>
> This particular system test tries to bounce brokers while produce is ongoing. 
> The test also has rf=3 and min.isr=3 configured, so if any brokers are 
> bounced before records are produced to every partition, it is possible to run 
> into OutOfOrderSequence exceptions similar to what is described in 
> https://issues.apache.org/jira/browse/KAFKA-14359
> When running the produce_consume_validate for the reassign_partitions_test, 
> instead of waiting for 5 acked messages, we should wait for messages to be 
> acked on the full set of partitions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-16570) FenceProducers API returns "unexpected error" when successful

2024-04-16 Thread Justine Olshan (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justine Olshan updated KAFKA-16570:
---
Description: 
When we want to fence a producer using the admin client, we send an 
InitProducerId request.

There is logic in that API to fence (and abort) any ongoing transactions and 
that is what the API relies on to fence the producer. However, this handling 
also returns CONCURRENT_TRANSACTIONS. In normal usage, this is good because we 
want to actually get a new producer ID and want to retry until the the ID is 
supplied or we time out.  
[https://github.com/apache/kafka/blob/5193eb93237ba9093ae444d73a1eaa2d6abcc9c1/core/src/main/scala/kafka/coordinator/transaction/TransactionCoordinator.scala#L170]
 
[https://github.com/confluentinc/ce-kafka/blob/b626db8bd94fe971adef3551518761a7be7de454/clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java#L1322]
 

In the case of fence producer, we don't retry and instead we have no handling 
for concurrent transactions and log a message about an unexpected error.
[https://github.com/confluentinc/ce-kafka/blob/b626db8bd94fe971adef3551518761a7be7de454/clients/src/main/java/org/apache/kafka/clients/admin/internals/FenceProducersHandler.java#L112]
 

This is not unexpected though and the operation was successful. We should just 
swallow this error and treat this as a successful run of the command. 

  was:
When we want to fence a producer using the admin client, we send an 
InitProducerId request.

There is logic in that API to fence (and abort) any ongoing transactions and 
that is what the API relies on to fence the producer. However, this handling 
also returns CONCURRENT_TRANSACTIONS. In normal usage, this is good because we 
want to actually get a new producer ID and want to retry until the the ID is 
supplied or we time out.  
[https://github.com/apache/kafka/blob/5193eb93237ba9093ae444d73a1eaa2d6abcc9c1/core/src/main/scala/kafka/coordinator/transaction/TransactionCoordinator.scala#L170]
 



In the case of fence producer, we don't retry and instead we have no handling 
for concurrent transactions and log a message about an unexpected error.
[https://github.com/confluentinc/ce-kafka/blob/b626db8bd94fe971adef3551518761a7be7de454/clients/src/main/java/org/apache/kafka/clients/admin/internals/FenceProducersHandler.java#L112]
 

This is not unexpected though and the operation was successful. We should just 
swallow this error and treat this as a successful run of the command. 


> FenceProducers API returns "unexpected error" when successful
> -
>
> Key: KAFKA-16570
> URL: https://issues.apache.org/jira/browse/KAFKA-16570
> Project: Kafka
>  Issue Type: Bug
>Reporter: Justine Olshan
>Assignee: Justine Olshan
>Priority: Major
>
> When we want to fence a producer using the admin client, we send an 
> InitProducerId request.
> There is logic in that API to fence (and abort) any ongoing transactions and 
> that is what the API relies on to fence the producer. However, this handling 
> also returns CONCURRENT_TRANSACTIONS. In normal usage, this is good because 
> we want to actually get a new producer ID and want to retry until the the ID 
> is supplied or we time out.  
> [https://github.com/apache/kafka/blob/5193eb93237ba9093ae444d73a1eaa2d6abcc9c1/core/src/main/scala/kafka/coordinator/transaction/TransactionCoordinator.scala#L170]
>  
> [https://github.com/confluentinc/ce-kafka/blob/b626db8bd94fe971adef3551518761a7be7de454/clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java#L1322]
>  
> In the case of fence producer, we don't retry and instead we have no handling 
> for concurrent transactions and log a message about an unexpected error.
> [https://github.com/confluentinc/ce-kafka/blob/b626db8bd94fe971adef3551518761a7be7de454/clients/src/main/java/org/apache/kafka/clients/admin/internals/FenceProducersHandler.java#L112]
>  
> This is not unexpected though and the operation was successful. We should 
> just swallow this error and treat this as a successful run of the command. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16570) FenceProducers API returns "unexpected error" when successful

2024-04-16 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837902#comment-17837902
 ] 

Justine Olshan commented on KAFKA-16570:


cc: [~ChrisEgerton] [~showuon] [~tombentley] as author and reviewers of the 
original change.

> FenceProducers API returns "unexpected error" when successful
> -
>
> Key: KAFKA-16570
> URL: https://issues.apache.org/jira/browse/KAFKA-16570
> Project: Kafka
>  Issue Type: Bug
>Reporter: Justine Olshan
>Assignee: Justine Olshan
>Priority: Major
>
> When we want to fence a producer using the admin client, we send an 
> InitProducerId request.
> There is logic in that API to fence (and abort) any ongoing transactions and 
> that is what the API relies on to fence the producer. However, this handling 
> also returns CONCURRENT_TRANSACTIONS. In normal usage, this is good because 
> we want to actually get a new producer ID and want to retry until the the ID 
> is supplied or we time out.  
> [https://github.com/apache/kafka/blob/5193eb93237ba9093ae444d73a1eaa2d6abcc9c1/core/src/main/scala/kafka/coordinator/transaction/TransactionCoordinator.scala#L170]
>  
> [https://github.com/confluentinc/ce-kafka/blob/b626db8bd94fe971adef3551518761a7be7de454/clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java#L1322]
>  
> In the case of fence producer, we don't retry and instead we have no handling 
> for concurrent transactions and log a message about an unexpected error.
> [https://github.com/confluentinc/ce-kafka/blob/b626db8bd94fe971adef3551518761a7be7de454/clients/src/main/java/org/apache/kafka/clients/admin/internals/FenceProducersHandler.java#L112]
>  
> This is not unexpected though and the operation was successful. We should 
> just swallow this error and treat this as a successful run of the command. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16570) FenceProducers API returns "unexpected error" when successful

2024-04-16 Thread Justine Olshan (Jira)
Justine Olshan created KAFKA-16570:
--

 Summary: FenceProducers API returns "unexpected error" when 
successful
 Key: KAFKA-16570
 URL: https://issues.apache.org/jira/browse/KAFKA-16570
 Project: Kafka
  Issue Type: Bug
Reporter: Justine Olshan
Assignee: Justine Olshan


When we want to fence a producer using the admin client, we send an 
InitProducerId request.

There is logic in that API to fence (and abort) any ongoing transactions and 
that is what the API relies on to fence the producer. However, this handling 
also returns CONCURRENT_TRANSACTIONS. In normal usage, this is good because we 
want to actually get a new producer ID and want to retry until the the ID is 
supplied or we time out.  
[https://github.com/apache/kafka/blob/5193eb93237ba9093ae444d73a1eaa2d6abcc9c1/core/src/main/scala/kafka/coordinator/transaction/TransactionCoordinator.scala#L170]
 



In the case of fence producer, we don't retry and instead we have no handling 
for concurrent transactions and log a message about an unexpected error.
[https://github.com/confluentinc/ce-kafka/blob/b626db8bd94fe971adef3551518761a7be7de454/clients/src/main/java/org/apache/kafka/clients/admin/internals/FenceProducersHandler.java#L112]
 

This is not unexpected though and the operation was successful. We should just 
swallow this error and treat this as a successful run of the command. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-16451) testDeltaFollower tests failing in ReplicaManager

2024-03-29 Thread Justine Olshan (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justine Olshan resolved KAFKA-16451.

Resolution: Duplicate

> testDeltaFollower tests failing in ReplicaManager
> -
>
> Key: KAFKA-16451
> URL: https://issues.apache.org/jira/browse/KAFKA-16451
> Project: Kafka
>  Issue Type: Bug
>Reporter: Justine Olshan
>Priority: Major
>
> many ReplicaManagerTests with the prefix testDeltaFollower seem to be 
> failing. A few other ReplicaManager tests as well. See existing failures in 
> [https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/2765/tests]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16451) testDeltaFollower tests failing in ReplicaManager

2024-03-29 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832330#comment-17832330
 ] 

Justine Olshan commented on KAFKA-16451:


closing as it is a duplicate of 
https://issues.apache.org/jira/browse/KAFKA-16447

> testDeltaFollower tests failing in ReplicaManager
> -
>
> Key: KAFKA-16451
> URL: https://issues.apache.org/jira/browse/KAFKA-16451
> Project: Kafka
>  Issue Type: Bug
>Reporter: Justine Olshan
>Priority: Major
>
> many ReplicaManagerTests with the prefix testDeltaFollower seem to be 
> failing. A few other ReplicaManager tests as well. See existing failures in 
> [https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/2765/tests]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16451) testDeltaFollower tests failing in ReplicaManager

2024-03-29 Thread Justine Olshan (Jira)
Justine Olshan created KAFKA-16451:
--

 Summary: testDeltaFollower tests failing in ReplicaManager
 Key: KAFKA-16451
 URL: https://issues.apache.org/jira/browse/KAFKA-16451
 Project: Kafka
  Issue Type: Bug
Reporter: Justine Olshan


many ReplicaManagerTests with the prefix testDeltaFollower seem to be failing. 
A few other ReplicaManager tests as well. See existing failures in 
[https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/2765/tests]
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14089) Flaky ExactlyOnceSourceIntegrationTest.testSeparateOffsetsTopic

2024-03-25 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830728#comment-17830728
 ] 

Justine Olshan commented on KAFKA-14089:


I've seen this again as well. 
https://ge.apache.org/scans/tests?search.names=Git%20branch=P28D=kafka=America%2FLos_Angeles=trunk=org.apache.kafka.connect.integration.ExactlyOnceSourceIntegrationTest=testSeparateOffsetsTopic

> Flaky ExactlyOnceSourceIntegrationTest.testSeparateOffsetsTopic
> ---
>
> Key: KAFKA-14089
> URL: https://issues.apache.org/jira/browse/KAFKA-14089
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Mickael Maison
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: failure.txt, 
> org.apache.kafka.connect.integration.ExactlyOnceSourceIntegrationTest.testSeparateOffsetsTopic.test.stdout
>
>
> It looks like the sequence got broken around "65535, 65537, 65536, 65539, 
> 65538, 65541, 65540, 65543"



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15914) Flaky test - testAlterSinkConnectorOffsetsOverriddenConsumerGroupId - OffsetsApiIntegrationTest

2024-03-25 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830727#comment-17830727
 ] 

Justine Olshan commented on KAFKA-15914:


I'm seeing this as well. 
https://ge.apache.org/scans/tests?search.names=Git%20branch=P28D=kafka=America%2FLos_Angeles=trunk=org.apache.kafka.connect.integration.OffsetsApiIntegrationTest=testAlterSinkConnectorOffsetsOverriddenConsumerGroupId

> Flaky test - testAlterSinkConnectorOffsetsOverriddenConsumerGroupId - 
> OffsetsApiIntegrationTest
> ---
>
> Key: KAFKA-15914
> URL: https://issues.apache.org/jira/browse/KAFKA-15914
> Project: Kafka
>  Issue Type: Bug
>  Components: connect
>Reporter: Lucas Brutschy
>Priority: Major
>  Labels: flaky-test
>
> Test intermittently gives the following result:
> {code}
> org.opentest4j.AssertionFailedError: Condition not met within timeout 3. 
> Sink connector consumer group offsets should catch up to the topic end 
> offsets ==> expected:  but was: 
>   at 
> org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>   at 
> org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
>   at org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)
>   at org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36)
>   at org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:210)
>   at 
> org.apache.kafka.test.TestUtils.lambda$waitForCondition$3(TestUtils.java:331)
>   at 
> org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:379)
>   at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:328)
>   at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:312)
>   at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:302)
>   at 
> org.apache.kafka.connect.integration.OffsetsApiIntegrationTest.verifyExpectedSinkConnectorOffsets(OffsetsApiIntegrationTest.java:917)
>   at 
> org.apache.kafka.connect.integration.OffsetsApiIntegrationTest.alterAndVerifySinkConnectorOffsets(OffsetsApiIntegrationTest.java:396)
>   at 
> org.apache.kafka.connect.integration.OffsetsApiIntegrationTest.testAlterSinkConnectorOffsetsOverriddenConsumerGroupId(OffsetsApiIntegrationTest.java:297)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (KAFKA-14359) Idempotent Producer continues to retry on OutOfOrderSequence error when first batch fails

2024-03-18 Thread Justine Olshan (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justine Olshan reassigned KAFKA-14359:
--

Assignee: Justine Olshan

> Idempotent Producer continues to retry on OutOfOrderSequence error when first 
> batch fails
> -
>
> Key: KAFKA-14359
> URL: https://issues.apache.org/jira/browse/KAFKA-14359
> Project: Kafka
>  Issue Type: Task
>Reporter: Justine Olshan
>Assignee: Justine Olshan
>Priority: Major
>
> When the idempotent producer does not have any state it can fall into a state 
> where the producer keeps retrying an out of order sequence. Consider the 
> following scenario where an idempotent producer has retries and delivery 
> timeout are int max (a configuration used in streams).
> 1. A producer send out several batches (up to 5) with the first one starting 
> at sequence 0.
> 2. The first batch with sequence 0 fails due to a transient error (ie, 
> NOT_LEADER_OR_FOLLOWER or a timeout error)
> 3. The second batch, say with sequence 200 comes in. Since there is no 
> previous state to invalidate it, it gets written to the log
> 4. The original batch is retried and will get an out of order sequence number
> 5. Current java client will continue to retry this batch, but it will never 
> resolve. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16360) Release plan of 3.x kafka releases.

2024-03-11 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17825386#comment-17825386
 ] 

Justine Olshan commented on KAFKA-16360:


Hey there – there was some discussion on the mailing list. 3.8 should be the 
last release. See here: 
[https://lists.apache.org/thread/kvdp2gmq5gd9txkvxh5vk3z2n55b04s5] 
There is also a KIP. KIP-1012: 
[https://cwiki.apache.org/confluence/display/KAFKA/KIP-1012%3A+The+need+for+a+Kafka+3.8.x+release]
Hope this clears things up. 3.8 should be the last release.

> Release plan of 3.x kafka releases.
> ---
>
> Key: KAFKA-16360
> URL: https://issues.apache.org/jira/browse/KAFKA-16360
> Project: Kafka
>  Issue Type: Improvement
>Reporter: kaushik srinivas
>Priority: Major
>
> KIP 
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-833%3A+Mark+KRaft+as+Production+Ready#KIP833:MarkKRaftasProductionReady-ReleaseTimeline]
>  mentions ,
> h2. Kafka 3.7
>  * January 2024
>  * Final release with ZK mode
> But we see in Jira, some tickets are marked for 3.8 release. Does apache 
> continue to make 3.x releases having zookeeper and kraft supported 
> independent of pure kraft 4.x releases ?
> If yes, how many more releases can be expected on 3.x release line ?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16352) Transaction may get get stuck in PrepareCommit or PrepareAbort state

2024-03-07 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824512#comment-17824512
 ] 

Justine Olshan commented on KAFKA-16352:


Thanks for filing [~alivshits]. If I understand correctly, the transaction is 
completed on the data partition side (in terms of writing the marker, lso, etc) 
but the coordinator is not able to complete the final book-keeping so we are 
unable to continue using the transactional id.

> Transaction may get get stuck in PrepareCommit or PrepareAbort state
> 
>
> Key: KAFKA-16352
> URL: https://issues.apache.org/jira/browse/KAFKA-16352
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Reporter: Artem Livshits
>Assignee: Artem Livshits
>Priority: Major
>
> A transaction took a long time to complete, trying to restart a producer 
> would lead to CONCURRENT_TRANSACTION errors.  Investigation has shown that 
> the transaction was stuck in PrepareCommit for a few days:
> (current time when the investigation happened: Feb 27 2024), transaction 
> state:
> {{Type   |Name                  |Value}}
> {{-}}
> {{ref    |transactionalId       |xxx-yyy}}
> {{long   |producerId            |299364}}
> {{ref    |state                 |kafka.coordinator.transaction.PrepareCommit$ 
> @ 0x44fe22760}}
> {{long   |txnStartTimestamp     |1708619624810  Thu Feb 22 2024 16:33:44.810 
> GMT+}}
> {{long   |txnLastUpdateTimestamp|1708619625335  Thu Feb 22 2024 16:33:45.335 
> GMT+}}
> {{-}}
> The partition list was empty and transactionsWithPendingMarkers didn't 
> contain the reference to the transactional state.  In the log there were the 
> following relevant messages:
> {{22 Feb 2024 @ 16:33:45.623 UTC [Transaction State Manager 1]: Completed 
> loading transaction metadata from __transaction_state-3 for coordinator epoch 
> 611}}
> (this is the partition that contains the transactional id).  After the data 
> is loaded, it sends out markers and etc.
> Then there is this message:
> {{22 Feb 2024 @ 16:33:45.696 UTC [Transaction Marker Request Completion 
> Handler 4]: Transaction coordinator epoch for xxx-yyy has changed from 610 to 
> 611; cancel sending transaction markers TxnMarkerEntry\{producerId=299364, 
> producerEpoch=1005, coordinatorEpoch=610, result=COMMIT, 
> partitions=[foo-bar]} to the brokers}}
> this message is logged just before the state is removed 
> transactionsWithPendingMarkers, but the state apparently contained the entry 
> that was created by the load operation.  So the sequence of events probably 
> looked like the following:
>  # partition load completed
>  # commit markers were sent for transactional id xxx-yyy; entry in 
> transactionsWithPendingMarkers was created
>  # zombie reply from the previous epoch completed, removed entry from 
> transactionsWithPendingMarkers
>  # commit markers properly completed, but couldn't transition to 
> CommitComplete state because transactionsWithPendingMarkers didn't have the 
> proper entry, so it got stuck there until the broker was restarted
> Looking at the code there are a few cases that could lead to similar race 
> conditions.  The fix it to keep track of the PendingCompleteTxn value that 
> was used when sending the marker, so that we can only remove the state that 
> was created when the marker was sent and not accidentally remove the state 
> someone else created.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16308) Formatting and Updating Kafka Features

2024-02-26 Thread Justine Olshan (Jira)
Justine Olshan created KAFKA-16308:
--

 Summary: Formatting and Updating Kafka Features
 Key: KAFKA-16308
 URL: https://issues.apache.org/jira/browse/KAFKA-16308
 Project: Kafka
  Issue Type: Task
Reporter: Justine Olshan
Assignee: Justine Olshan


As part of KIP-1022, we need to extend the storage and upgrade tools to support 
features other than metadata version. 

See 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-1023%3A+Formatting+and+Updating+Features



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16282) Allow to get last stable offset (LSO) in kafka-get-offsets.sh

2024-02-26 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17820821#comment-17820821
 ] 

Justine Olshan commented on KAFKA-16282:


Thanks [~ahmedsobeh] 

A few things:
For the Admin.java proposed change, picking one of the options and listing the 
other as a rejected alternative. We may go with one or the other, but it is 
good to take an opinion for folks to agree or disagree with. 

For the compatibility section, we could have compatibility issues if someone 
uses this new field on a broker that doesn't support it. We should probably 
throw and error in that case. 

Generally though, I think the KIP is ready for discussion, I may have further 
comments for you on the mailing list :) 

> Allow to get last stable offset (LSO) in kafka-get-offsets.sh
> -
>
> Key: KAFKA-16282
> URL: https://issues.apache.org/jira/browse/KAFKA-16282
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Luke Chen
>Assignee: Ahmed Sobeh
>Priority: Major
>  Labels: need-kip, newbie, newbie++
>
> Currently, when using `kafka-get-offsets.sh` to get the offset by time, we 
> have these choices:
> {code:java}
> --time  /  timestamp of the offsets before 
> that. 
>   -1 or latest /   [Note: No offset is returned, if 
> the
>   -2 or earliest / timestamp greater than recently
>   -3 or max-timestamp /committed record timestamp is
>   -4 or earliest-local /   given.] (default: latest)
>   -5 or latest-tiered  
> {code}
> For the "latest" option, it'll always return the "high watermark" because we 
> always send with the default option: *IsolationLevel.READ_UNCOMMITTED*. It 
> would be good if the command can support to get the last stable offset (LSO) 
> for transaction support. That is, sending the option with 
> *IsolationLevel.READ_COMMITTED*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-16302) Builds failing due to streams test execution failures

2024-02-22 Thread Justine Olshan (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justine Olshan resolved KAFKA-16302.

Resolution: Fixed

> Builds failing due to streams test execution failures
> -
>
> Key: KAFKA-16302
> URL: https://issues.apache.org/jira/browse/KAFKA-16302
> Project: Kafka
>  Issue Type: Task
>  Components: streams, unit tests
>Reporter: Justine Olshan
>Assignee: Justine Olshan
>Priority: Major
>
> I'm seeing this on master and many PR builds for all versions:
>  
> {code:java}
> [2024-02-22T14:37:07.076Z] * What went wrong:
> https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1426[2024-02-22T14:37:07.076Z]
>  Execution failed for task ':streams:test'.
> https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1427[2024-02-22T14:37:07.076Z]
>  > The following test methods could not be retried, which is unexpected. 
> Please file a bug report at 
> https://github.com/gradle/test-retry-gradle-plugin/issues
> https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1428[2024-02-22T14:37:07.076Z]
>  
> org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.SessionKeySchema@78d39a69]
> https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1429[2024-02-22T14:37:07.076Z]
>  
> org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.WindowKeySchema@3c818ac4]
> https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1430[2024-02-22T14:37:07.076Z]
>  
> org.apache.kafka.streams.state.internals.RocksDBTimestampedSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.WindowKeySchema@251f7d26]
> https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1431[2024-02-22T14:37:07.076Z]
>  
> org.apache.kafka.streams.state.internals.RocksDBTimestampedSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.SessionKeySchema@52c8295b]
> https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1432[2024-02-22T14:37:07.076Z]
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16302) Builds failing due to streams test execution failures

2024-02-22 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819835#comment-17819835
 ] 

Justine Olshan commented on KAFKA-16302:


Root cause is [https://github.com/apache/kafka/pull/15396] removing the log 
line needed for these tests to pass


{code:java}
java.lang.AssertionError: Expected: a collection containing "Skipping record 
for expired segment." but: was empty{code}

> Builds failing due to streams test execution failures
> -
>
> Key: KAFKA-16302
> URL: https://issues.apache.org/jira/browse/KAFKA-16302
> Project: Kafka
>  Issue Type: Task
>Reporter: Justine Olshan
>Priority: Major
>
> I'm seeing this on master and many PR builds for all versions:
>  
> {code:java}
> [2024-02-22T14:37:07.076Z] * What went wrong:
> https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1426[2024-02-22T14:37:07.076Z]
>  Execution failed for task ':streams:test'.
> https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1427[2024-02-22T14:37:07.076Z]
>  > The following test methods could not be retried, which is unexpected. 
> Please file a bug report at 
> https://github.com/gradle/test-retry-gradle-plugin/issues
> https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1428[2024-02-22T14:37:07.076Z]
>  
> org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.SessionKeySchema@78d39a69]
> https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1429[2024-02-22T14:37:07.076Z]
>  
> org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.WindowKeySchema@3c818ac4]
> https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1430[2024-02-22T14:37:07.076Z]
>  
> org.apache.kafka.streams.state.internals.RocksDBTimestampedSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.WindowKeySchema@251f7d26]
> https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1431[2024-02-22T14:37:07.076Z]
>  
> org.apache.kafka.streams.state.internals.RocksDBTimestampedSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.SessionKeySchema@52c8295b]
> https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1432[2024-02-22T14:37:07.076Z]
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-16302) Builds failing due to streams test execution failures

2024-02-22 Thread Justine Olshan (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justine Olshan updated KAFKA-16302:
---
Description: 
I'm seeing this on master and many PR builds for all versions:

 
{code:java}
[2024-02-22T14:37:07.076Z] * What went wrong:
https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1426[2024-02-22T14:37:07.076Z]
 Execution failed for task ':streams:test'.
https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1427[2024-02-22T14:37:07.076Z]
 > The following test methods could not be retried, which is unexpected. Please 
file a bug report at https://github.com/gradle/test-retry-gradle-plugin/issues
https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1428[2024-02-22T14:37:07.076Z]
 
org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.SessionKeySchema@78d39a69]
https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1429[2024-02-22T14:37:07.076Z]
 
org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.WindowKeySchema@3c818ac4]
https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1430[2024-02-22T14:37:07.076Z]
 
org.apache.kafka.streams.state.internals.RocksDBTimestampedSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.WindowKeySchema@251f7d26]
https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1431[2024-02-22T14:37:07.076Z]
 
org.apache.kafka.streams.state.internals.RocksDBTimestampedSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.SessionKeySchema@52c8295b]
https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1432[2024-02-22T14:37:07.076Z]
{code}

 

  was:
I'm seeing this on master and many PR builds for all versions:

```
[2024-02-22T14:37:07.076Z] * What went wrong: 

[|https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1426][2024-02-22T14:37:07.076Z]
 Execution failed for task ':streams:test'. 

[|https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1427][2024-02-22T14:37:07.076Z]
 > The following test methods could not be retried, which is unexpected. Please 
file a bug report at 
[https://github.com/gradle/test-retry-gradle-plugin/issues] 

[|https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1428][2024-02-22T14:37:07.076Z]
 
org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.SessionKeySchema@78d39a69]
 

[|https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1429][2024-02-22T14:37:07.076Z]
 
org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.WindowKeySchema@3c818ac4]
 

[|https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1430][2024-02-22T14:37:07.076Z]
 
org.apache.kafka.streams.state.internals.RocksDBTimestampedSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.WindowKeySchema@251f7d26]
 

[|https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1431][2024-02-22T14:37:07.076Z]
 
org.apache.kafka.streams.state.internals.RocksDBTimestampedSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.SessionKeySchema@52c8295b]
```
 


[|https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1432][2024-02-22T14:37:07.076Z]
 
 


> Builds failing due to streams test execution failures
> -
>
> Key: KAFKA-16302
> URL: https://issues.apache.org/jira/browse/KAFKA-16302
> Project: Kafka
>  Issue Type: Task
>Reporter: Justine Olshan
>Priority: Major
>
> I'm seeing this on master and many PR builds for all versions:
>  
> {code:java}
> [2024-02-22T14:37:07.076Z] * What went wrong:
> 

[jira] [Created] (KAFKA-16302) Builds failing due to streams test execution failures

2024-02-22 Thread Justine Olshan (Jira)
Justine Olshan created KAFKA-16302:
--

 Summary: Builds failing due to streams test execution failures
 Key: KAFKA-16302
 URL: https://issues.apache.org/jira/browse/KAFKA-16302
 Project: Kafka
  Issue Type: Task
Reporter: Justine Olshan


I'm seeing this on master and many PR builds for all versions:

```
[2024-02-22T14:37:07.076Z] * What went wrong: 

[|https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1426][2024-02-22T14:37:07.076Z]
 Execution failed for task ':streams:test'. 

[|https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1427][2024-02-22T14:37:07.076Z]
 > The following test methods could not be retried, which is unexpected. Please 
file a bug report at 
[https://github.com/gradle/test-retry-gradle-plugin/issues] 

[|https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1428][2024-02-22T14:37:07.076Z]
 
org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.SessionKeySchema@78d39a69]
 

[|https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1429][2024-02-22T14:37:07.076Z]
 
org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.WindowKeySchema@3c818ac4]
 

[|https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1430][2024-02-22T14:37:07.076Z]
 
org.apache.kafka.streams.state.internals.RocksDBTimestampedSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.WindowKeySchema@251f7d26]
 

[|https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1431][2024-02-22T14:37:07.076Z]
 
org.apache.kafka.streams.state.internals.RocksDBTimestampedSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.SessionKeySchema@52c8295b]
```
 


[|https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1432][2024-02-22T14:37:07.076Z]
 
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15665) Enforce ISR to have all target replicas when complete partition reassignment

2024-02-21 Thread Justine Olshan (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justine Olshan resolved KAFKA-15665.

Resolution: Fixed

> Enforce ISR to have all target replicas when complete partition reassignment
> 
>
> Key: KAFKA-15665
> URL: https://issues.apache.org/jira/browse/KAFKA-15665
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Calvin Liu
>Assignee: Calvin Liu
>Priority: Major
>
> Current partition reassignment can be completed when the new ISR is under min 
> ISR. We should fix this behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (KAFKA-15898) Flaky test: testFenceMultipleBrokers() – org.apache.kafka.controller.QuorumControllerTest

2024-02-21 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819351#comment-17819351
 ] 

Justine Olshan edited comment on KAFKA-15898 at 2/21/24 6:26 PM:
-

I have 4 failures on one run 
[https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15384/8/tests]


was (Author: jolshan):
I have 8 failures on one run 
https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15384/8/tests

> Flaky test: testFenceMultipleBrokers() – 
> org.apache.kafka.controller.QuorumControllerTest
> -
>
> Key: KAFKA-15898
> URL: https://issues.apache.org/jira/browse/KAFKA-15898
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apoorv Mittal
>Priority: Major
>  Labels: flaky-test
>
> Build run: 
> [https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-14699/21/tests/]
>  
> {code:java}
> java.util.concurrent.TimeoutException: testFenceMultipleBrokers() timed out 
> after 40 secondsStacktracejava.util.concurrent.TimeoutException: 
> testFenceMultipleBrokers() timed out after 40 secondsat 
> org.junit.jupiter.engine.extension.TimeoutExceptionFactory.create(TimeoutExceptionFactory.java:29)
>at 
> org.junit.jupiter.engine.extension.SameThreadTimeoutInvocation.proceed(SameThreadTimeoutInvocation.java:58)
>   at 
> org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:156)
>  at 
> org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:147)
>at 
> org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestMethod(TimeoutExtension.java:86)
> at 
> org.junit.jupiter.engine.execution.InterceptingExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(InterceptingExecutableInvoker.java:103)
>  at 
> org.junit.jupiter.engine.execution.InterceptingExecutableInvoker.lambda$invoke$0(InterceptingExecutableInvoker.java:93)
>   at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106)
>  at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64)
> at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45)
>  at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37)
>  at 
> org.junit.jupiter.engine.execution.InterceptingExecutableInvoker.invoke(InterceptingExecutableInvoker.java:92)
>at 
> org.junit.jupiter.engine.execution.InterceptingExecutableInvoker.invoke(InterceptingExecutableInvoker.java:86)
>at 
> org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$7(TestMethodTestDescriptor.java:218)
> at 
> org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
> at 
> org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:214)
>  at 
> org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:139)
>   at 
> org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:69)
>at 
> org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$6(NodeTestTask.java:151)
>at 
> org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
> at 
> org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$8(NodeTestTask.java:141)
>at 
> org.junit.platform.engine.support.hierarchical.Node.around(Node.java:137)
> at 
> org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$9(NodeTestTask.java:139)
>at 
> org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
> at 
> org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:138)
> at 
> org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:95)
> at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)   at 
> org.junit.platform.engine.support.hierarchical.SameThreadHierarchicalTestExecutorService.invokeAll(SameThreadHierarchicalTestExecutorService.java:41)
> at 
> org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$6(NodeTestTask.java:155)
>at 
> org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
> at 
> 

[jira] [Commented] (KAFKA-15898) Flaky test: testFenceMultipleBrokers() – org.apache.kafka.controller.QuorumControllerTest

2024-02-21 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819351#comment-17819351
 ] 

Justine Olshan commented on KAFKA-15898:


I have 8 failures on one run 
https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15384/8/tests

> Flaky test: testFenceMultipleBrokers() – 
> org.apache.kafka.controller.QuorumControllerTest
> -
>
> Key: KAFKA-15898
> URL: https://issues.apache.org/jira/browse/KAFKA-15898
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apoorv Mittal
>Priority: Major
>  Labels: flaky-test
>
> Build run: 
> [https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-14699/21/tests/]
>  
> {code:java}
> java.util.concurrent.TimeoutException: testFenceMultipleBrokers() timed out 
> after 40 secondsStacktracejava.util.concurrent.TimeoutException: 
> testFenceMultipleBrokers() timed out after 40 secondsat 
> org.junit.jupiter.engine.extension.TimeoutExceptionFactory.create(TimeoutExceptionFactory.java:29)
>at 
> org.junit.jupiter.engine.extension.SameThreadTimeoutInvocation.proceed(SameThreadTimeoutInvocation.java:58)
>   at 
> org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:156)
>  at 
> org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:147)
>at 
> org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestMethod(TimeoutExtension.java:86)
> at 
> org.junit.jupiter.engine.execution.InterceptingExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(InterceptingExecutableInvoker.java:103)
>  at 
> org.junit.jupiter.engine.execution.InterceptingExecutableInvoker.lambda$invoke$0(InterceptingExecutableInvoker.java:93)
>   at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106)
>  at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64)
> at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45)
>  at 
> org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37)
>  at 
> org.junit.jupiter.engine.execution.InterceptingExecutableInvoker.invoke(InterceptingExecutableInvoker.java:92)
>at 
> org.junit.jupiter.engine.execution.InterceptingExecutableInvoker.invoke(InterceptingExecutableInvoker.java:86)
>at 
> org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$7(TestMethodTestDescriptor.java:218)
> at 
> org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
> at 
> org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:214)
>  at 
> org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:139)
>   at 
> org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:69)
>at 
> org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$6(NodeTestTask.java:151)
>at 
> org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
> at 
> org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$8(NodeTestTask.java:141)
>at 
> org.junit.platform.engine.support.hierarchical.Node.around(Node.java:137)
> at 
> org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$9(NodeTestTask.java:139)
>at 
> org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
> at 
> org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:138)
> at 
> org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:95)
> at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)   at 
> org.junit.platform.engine.support.hierarchical.SameThreadHierarchicalTestExecutorService.invokeAll(SameThreadHierarchicalTestExecutorService.java:41)
> at 
> org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$6(NodeTestTask.java:155)
>at 
> org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
> at 
> org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$8(NodeTestTask.java:141)
>at 
> org.junit.platform.engine.support.hierarchical.Node.around(Node.java:137)

[jira] [Commented] (KAFKA-16283) RoundRobinPartitioner will only send to half of the partitions in a topic

2024-02-21 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819325#comment-17819325
 ] 

Justine Olshan commented on KAFKA-16283:


also https://issues.apache.org/jira/browse/KAFKA-14573 

> RoundRobinPartitioner will only send to half of the partitions in a topic
> -
>
> Key: KAFKA-16283
> URL: https://issues.apache.org/jira/browse/KAFKA-16283
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.1.0, 3.0.0, 3.6.1
>Reporter: Luke Chen
>Priority: Major
>
> When using `org.apache.kafka.clients.producer.RoundRobinPartitioner`, we 
> expect data are sent to all partitions in round-robin manner. But we found 
> there are only half of the partitions got the data. This causes half of the 
> resources(storage, consumer...) are wasted.
> {code:java}
> > bin/kafka-topics.sh --create --topic quickstart-events4 --bootstrap-server 
> > localhost:9092 --partitions 2 
> Created topic quickstart-events4.
> # send 1000 records to the topic, expecting 500 records in partition0, and 
> 500 records in partition1
> > bin/kafka-producer-perf-test.sh --topic quickstart-events4 --num-records 
> > 1000 --record-size 1024 --throughput -1 --producer-props 
> > bootstrap.servers=localhost:9092 
> > partitioner.class=org.apache.kafka.clients.producer.RoundRobinPartitioner
> 1000 records sent, 6535.947712 records/sec (6.38 MB/sec), 2.88 ms avg 
> latency, 121.00 ms max latency, 2 ms 50th, 7 ms 95th, 10 ms 99th, 121 ms 
> 99.9th.
> > ls -al /tmp/kafka-logs/quickstart-events4-1
> total 24
> drwxr-xr-x   7 lukchen  wheel   224  2 20 19:53 .
> drwxr-xr-x  70 lukchen  wheel  2240  2 20 19:53 ..
> -rw-r--r--   1 lukchen  wheel  10485760  2 20 19:53 .index
> -rw-r--r--   1 lukchen  wheel   1037819  2 20 19:53 .log
> -rw-r--r--   1 lukchen  wheel  10485756  2 20 19:53 
> .timeindex
> -rw-r--r--   1 lukchen  wheel 8  2 20 19:53 leader-epoch-checkpoint
> -rw-r--r--   1 lukchen  wheel43  2 20 19:53 partition.metadata
> # No records in partition 1
> > ls -al /tmp/kafka-logs/quickstart-events4-0
> total 8
> drwxr-xr-x   7 lukchen  wheel   224  2 20 19:53 .
> drwxr-xr-x  70 lukchen  wheel  2240  2 20 19:53 ..
> -rw-r--r--   1 lukchen  wheel  10485760  2 20 19:53 .index
> -rw-r--r--   1 lukchen  wheel 0  2 20 19:53 .log
> -rw-r--r--   1 lukchen  wheel  10485756  2 20 19:53 
> .timeindex
> -rw-r--r--   1 lukchen  wheel 0  2 20 19:53 leader-epoch-checkpoint
> -rw-r--r--   1 lukchen  wheel43  2 20 19:53 partition.metadata
> {code}
> Tested in kafka 3.0.0, 3.2.3, and the latest trunk, they all have the same 
> issue. It should already exist for a long time.
>  
> Had a quick look, it's because we will abortOnNewBatch each time when new 
> batch created.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16283) RoundRobinPartitioner will only send to half of the partitions in a topic

2024-02-21 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819324#comment-17819324
 ] 

Justine Olshan commented on KAFKA-16283:


https://issues.apache.org/jira/browse/KAFKA-13359 Is another dupe?

> RoundRobinPartitioner will only send to half of the partitions in a topic
> -
>
> Key: KAFKA-16283
> URL: https://issues.apache.org/jira/browse/KAFKA-16283
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.1.0, 3.0.0, 3.6.1
>Reporter: Luke Chen
>Priority: Major
>
> When using `org.apache.kafka.clients.producer.RoundRobinPartitioner`, we 
> expect data are sent to all partitions in round-robin manner. But we found 
> there are only half of the partitions got the data. This causes half of the 
> resources(storage, consumer...) are wasted.
> {code:java}
> > bin/kafka-topics.sh --create --topic quickstart-events4 --bootstrap-server 
> > localhost:9092 --partitions 2 
> Created topic quickstart-events4.
> # send 1000 records to the topic, expecting 500 records in partition0, and 
> 500 records in partition1
> > bin/kafka-producer-perf-test.sh --topic quickstart-events4 --num-records 
> > 1000 --record-size 1024 --throughput -1 --producer-props 
> > bootstrap.servers=localhost:9092 
> > partitioner.class=org.apache.kafka.clients.producer.RoundRobinPartitioner
> 1000 records sent, 6535.947712 records/sec (6.38 MB/sec), 2.88 ms avg 
> latency, 121.00 ms max latency, 2 ms 50th, 7 ms 95th, 10 ms 99th, 121 ms 
> 99.9th.
> > ls -al /tmp/kafka-logs/quickstart-events4-1
> total 24
> drwxr-xr-x   7 lukchen  wheel   224  2 20 19:53 .
> drwxr-xr-x  70 lukchen  wheel  2240  2 20 19:53 ..
> -rw-r--r--   1 lukchen  wheel  10485760  2 20 19:53 .index
> -rw-r--r--   1 lukchen  wheel   1037819  2 20 19:53 .log
> -rw-r--r--   1 lukchen  wheel  10485756  2 20 19:53 
> .timeindex
> -rw-r--r--   1 lukchen  wheel 8  2 20 19:53 leader-epoch-checkpoint
> -rw-r--r--   1 lukchen  wheel43  2 20 19:53 partition.metadata
> # No records in partition 1
> > ls -al /tmp/kafka-logs/quickstart-events4-0
> total 8
> drwxr-xr-x   7 lukchen  wheel   224  2 20 19:53 .
> drwxr-xr-x  70 lukchen  wheel  2240  2 20 19:53 ..
> -rw-r--r--   1 lukchen  wheel  10485760  2 20 19:53 .index
> -rw-r--r--   1 lukchen  wheel 0  2 20 19:53 .log
> -rw-r--r--   1 lukchen  wheel  10485756  2 20 19:53 
> .timeindex
> -rw-r--r--   1 lukchen  wheel 0  2 20 19:53 leader-epoch-checkpoint
> -rw-r--r--   1 lukchen  wheel43  2 20 19:53 partition.metadata
> {code}
> Tested in kafka 3.0.0, 3.2.3, and the latest trunk, they all have the same 
> issue. It should already exist for a long time.
>  
> Had a quick look, it's because we will abortOnNewBatch each time when new 
> batch created.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16212) Cache partitions by TopicIdPartition instead of TopicPartition

2024-02-20 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819023#comment-17819023
 ] 

Justine Olshan commented on KAFKA-16212:


We use something similar in the fetch session cache when the topic ID is 
unknown. 
This transition state will be as long as we support ZK I would suspect. 

 

> how would this look like that I need to be aware off during extending 
> ReplicaManager cache to be topicId aware
Not sure I understand this question.

> Cache partitions by TopicIdPartition instead of TopicPartition
> --
>
> Key: KAFKA-16212
> URL: https://issues.apache.org/jira/browse/KAFKA-16212
> Project: Kafka
>  Issue Type: Improvement
>Affects Versions: 3.7.0
>Reporter: Gaurav Narula
>Assignee: Omnia Ibrahim
>Priority: Major
>
> From the discussion in [PR 
> 15263|https://github.com/apache/kafka/pull/15263#discussion_r1471075201], it 
> would be better to cache {{allPartitions}} by {{TopicIdPartition}} instead of 
> {{TopicPartition}} to avoid ambiguity.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16282) Allow to get last stable offset (LSO) in kafka-get-offsets.sh

2024-02-20 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17818903#comment-17818903
 ] 

Justine Olshan commented on KAFKA-16282:


I am also happy to review the KIP and PRs :) 

> Allow to get last stable offset (LSO) in kafka-get-offsets.sh
> -
>
> Key: KAFKA-16282
> URL: https://issues.apache.org/jira/browse/KAFKA-16282
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Luke Chen
>Assignee: Ahmed Sobeh
>Priority: Major
>  Labels: need-kip, newbie, newbie++
>
> Currently, when using `kafka-get-offsets.sh` to get the offset by time, we 
> have these choices:
> {code:java}
> --time  /  timestamp of the offsets before 
> that. 
>   -1 or latest /   [Note: No offset is returned, if 
> the
>   -2 or earliest / timestamp greater than recently
>   -3 or max-timestamp /committed record timestamp is
>   -4 or earliest-local /   given.] (default: latest)
>   -5 or latest-tiered  
> {code}
> For the "latest" option, it'll always return the "high watermark" because we 
> always send with the default option: *IsolationLevel.READ_UNCOMMITTED*. It 
> would be good if the command can support to get the last stable offset (LSO) 
> for transaction support. That is, sending the option with 
> *IsolationLevel.READ_COMMITTED*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16264) Expose `producer.id.expiration.check.interval.ms` as dynamic broker configuration

2024-02-16 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17818064#comment-17818064
 ] 

Justine Olshan commented on KAFKA-16264:


Hey [~jeqo] thanks for filing. I was also thinking about this after you filed 
the first ticket.


One thing that is interesting is right now the expiration is scheduled on a 
separate thread on startup. I guess the best course of action is to cancel that 
scheduled task and create a new one?
See UnifiedLog producerExpireCheck.

> Expose `producer.id.expiration.check.interval.ms` as dynamic broker 
> configuration
> -
>
> Key: KAFKA-16264
> URL: https://issues.apache.org/jira/browse/KAFKA-16264
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Jorge Esteban Quilcate Otoya
>Assignee: Jorge Esteban Quilcate Otoya
>Priority: Major
>
> Dealing with a scenario where too many producer ids lead to issues (e.g. high 
> cpu utilization, see KAFKA-16229) put operators in need to flush producer ids 
> more promptly than usual.
> Currently, only the expiration timeout `producer.id.expiration.ms` is exposed 
> as dynamic config. This is helpful (e.g. by reducing the timeout, less 
> producer would eventually be kept in memory), but not enough if the 
> evaluation frequency is not sufficiently short to flush producer ids before 
> becoming an issue. Only by tuning both, the issue could be workaround.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16212) Cache partitions by TopicIdPartition instead of TopicPartition

2024-02-15 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17817795#comment-17817795
 ] 

Justine Olshan commented on KAFKA-16212:


Thanks [~omnia_h_ibrahim] I remember this particularly being a tricky area for 
topic IDs, so I appreciate the time being spent here.

 

For fetch requests, I remember we had to do something similar to proposal 1 for 
the fetch path. When we receive fetch request < 13, we will have a zero topic 
ID stored. (Note – this is a placeholder and can not be used as a valid topic 
ID)


Does using the zero uuid make some of option 1 easier? Or are we already using 
the zero uuid to signify something?

> Cache partitions by TopicIdPartition instead of TopicPartition
> --
>
> Key: KAFKA-16212
> URL: https://issues.apache.org/jira/browse/KAFKA-16212
> Project: Kafka
>  Issue Type: Improvement
>Affects Versions: 3.7.0
>Reporter: Gaurav Narula
>Assignee: Omnia Ibrahim
>Priority: Major
>
> From the discussion in [PR 
> 15263|https://github.com/apache/kafka/pull/15263#discussion_r1471075201], it 
> would be better to cache {{allPartitions}} by {{TopicIdPartition}} instead of 
> {{TopicPartition}} to avoid ambiguity.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16245) DescribeConsumerGroupTest failing

2024-02-12 Thread Justine Olshan (Jira)
Justine Olshan created KAFKA-16245:
--

 Summary: DescribeConsumerGroupTest failing
 Key: KAFKA-16245
 URL: https://issues.apache.org/jira/browse/KAFKA-16245
 Project: Kafka
  Issue Type: Task
Reporter: Justine Olshan


The first instances on trunk are in this PR 
[https://github.com/apache/kafka/pull/15275]
And this PR seems to have it failing consistently in the builds when it wasn't 
failing this consistently before.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-16229) Slow expiration of Producer IDs leading to high CPU usage

2024-02-12 Thread Justine Olshan (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justine Olshan resolved KAFKA-16229.

Resolution: Fixed

> Slow expiration of Producer IDs leading to high CPU usage
> -
>
> Key: KAFKA-16229
> URL: https://issues.apache.org/jira/browse/KAFKA-16229
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jorge Esteban Quilcate Otoya
>Assignee: Jorge Esteban Quilcate Otoya
>Priority: Major
>
> Expiration of ProducerIds is implemented with a slow removal of map keys:
> ```
>         producers.keySet().removeAll(keys);
> ```
> Unnecessarily going through all producer ids and then throw all expired keys 
> to be removed.
> This leads to exponential time on worst case when most/all keys need to be 
> removed:
> ```
> Benchmark                                        (numProducerIds)  Mode  Cnt  
>          Score            Error  Units
> ProducerStateManagerBench.testDeleteExpiringIds               100  avgt    3  
>       9164.043 ±      10647.877  ns/op
> ProducerStateManagerBench.testDeleteExpiringIds              1000  avgt    3  
>     341561.093 ±      20283.211  ns/op
> ProducerStateManagerBench.testDeleteExpiringIds             1  avgt    3  
>   44957983.550 ±    9389011.290  ns/op
> ProducerStateManagerBench.testDeleteExpiringIds            10  avgt    3  
> 5683374164.167 ± 1446242131.466  ns/op
> ```
> A simple fix is to use map#remove(key) instead, leading to a more linear 
> growth:
> ```
> Benchmark                                        (numProducerIds)  Mode  Cnt  
>       Score         Error  Units
> ProducerStateManagerBench.testDeleteExpiringIds               100  avgt    3  
>    5779.056 ±     651.389  ns/op
> ProducerStateManagerBench.testDeleteExpiringIds              1000  avgt    3  
>   61430.530 ±   21875.644  ns/op
> ProducerStateManagerBench.testDeleteExpiringIds             1  avgt    3  
>  643887.031 ±  600475.302  ns/op
> ProducerStateManagerBench.testDeleteExpiringIds            10  avgt    3  
> 7741689.539 ± 3218317.079  ns/op
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16225) Flaky test suite LogDirFailureTest

2024-02-09 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816260#comment-17816260
 ] 

Justine Olshan commented on KAFKA-16225:


Looks like it happened on 3.7 the same day which narrows it down to 3 commits?
https://ge.apache.org/scans/tests?search.names=Git%20branch=P28D=kafka=America%2FLos_Angeles=3.7=kafka.server.LogDirFailureTest

> Flaky test suite LogDirFailureTest
> --
>
> Key: KAFKA-16225
> URL: https://issues.apache.org/jira/browse/KAFKA-16225
> Project: Kafka
>  Issue Type: Bug
>  Components: core, unit tests
>Reporter: Greg Harris
>Priority: Major
>  Labels: flaky-test
>
> I see this failure on trunk and in PR builds for multiple methods in this 
> test suite:
> {noformat}
> org.opentest4j.AssertionFailedError: expected:  but was:     
> at 
> org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>     
> at 
> org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
>     
> at org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)    
> at org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36)    
> at org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31)    
> at org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:179)    
> at kafka.utils.TestUtils$.causeLogDirFailure(TestUtils.scala:1715)    
> at 
> kafka.server.LogDirFailureTest.testProduceAfterLogDirFailureOnLeader(LogDirFailureTest.scala:186)
>     
> at 
> kafka.server.LogDirFailureTest.testIOExceptionDuringLogRoll(LogDirFailureTest.scala:70){noformat}
> It appears this assertion is failing
> [https://github.com/apache/kafka/blob/f54975c33135140351c50370282e86c49c81bbdd/core/src/test/scala/unit/kafka/utils/TestUtils.scala#L1715]
> The other error which is appearing is this:
> {noformat}
> org.opentest4j.AssertionFailedError: Unexpected exception type thrown, 
> expected:  but was: 
>     
> at 
> org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>     
> at org.junit.jupiter.api.AssertThrows.assertThrows(AssertThrows.java:67)    
> at org.junit.jupiter.api.AssertThrows.assertThrows(AssertThrows.java:35)    
> at org.junit.jupiter.api.Assertions.assertThrows(Assertions.java:3111)    
> at 
> kafka.server.LogDirFailureTest.testProduceErrorsFromLogDirFailureOnLeader(LogDirFailureTest.scala:164)
>     
> at 
> kafka.server.LogDirFailureTest.testProduceErrorFromFailureOnLogRoll(LogDirFailureTest.scala:64){noformat}
> Failures appear to have started in this commit, but this does not indicate 
> that this commit is at fault: 
> [https://github.com/apache/kafka/tree/3d95a69a28c2d16e96618cfa9a1eb69180fb66ea]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16225) Flaky test suite LogDirFailureTest

2024-02-09 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816258#comment-17816258
 ] 

Justine Olshan commented on KAFKA-16225:


I'm encountering this too. It looks pretty bad

https://ge.apache.org/scans/tests?search.names=Git%20branch=P28D=kafka=America%2FLos_Angeles=trunk=kafka.server.LogDirFailureTest

> Flaky test suite LogDirFailureTest
> --
>
> Key: KAFKA-16225
> URL: https://issues.apache.org/jira/browse/KAFKA-16225
> Project: Kafka
>  Issue Type: Bug
>  Components: core, unit tests
>Reporter: Greg Harris
>Priority: Major
>  Labels: flaky-test
>
> I see this failure on trunk and in PR builds for multiple methods in this 
> test suite:
> {noformat}
> org.opentest4j.AssertionFailedError: expected:  but was:     
> at 
> org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>     
> at 
> org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
>     
> at org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)    
> at org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36)    
> at org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31)    
> at org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:179)    
> at kafka.utils.TestUtils$.causeLogDirFailure(TestUtils.scala:1715)    
> at 
> kafka.server.LogDirFailureTest.testProduceAfterLogDirFailureOnLeader(LogDirFailureTest.scala:186)
>     
> at 
> kafka.server.LogDirFailureTest.testIOExceptionDuringLogRoll(LogDirFailureTest.scala:70){noformat}
> It appears this assertion is failing
> [https://github.com/apache/kafka/blob/f54975c33135140351c50370282e86c49c81bbdd/core/src/test/scala/unit/kafka/utils/TestUtils.scala#L1715]
> The other error which is appearing is this:
> {noformat}
> org.opentest4j.AssertionFailedError: Unexpected exception type thrown, 
> expected:  but was: 
>     
> at 
> org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>     
> at org.junit.jupiter.api.AssertThrows.assertThrows(AssertThrows.java:67)    
> at org.junit.jupiter.api.AssertThrows.assertThrows(AssertThrows.java:35)    
> at org.junit.jupiter.api.Assertions.assertThrows(Assertions.java:3111)    
> at 
> kafka.server.LogDirFailureTest.testProduceErrorsFromLogDirFailureOnLeader(LogDirFailureTest.scala:164)
>     
> at 
> kafka.server.LogDirFailureTest.testProduceErrorFromFailureOnLogRoll(LogDirFailureTest.scala:64){noformat}
> Failures appear to have started in this commit, but this does not indicate 
> that this commit is at fault: 
> [https://github.com/apache/kafka/tree/3d95a69a28c2d16e96618cfa9a1eb69180fb66ea]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-14920) Address timeouts and out of order sequences

2024-02-07 Thread Justine Olshan (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justine Olshan updated KAFKA-14920:
---
Description: 
KAFKA-14844 showed the destructive nature of a timeout on the first produce 
request for a topic partition (ie one that has no state in psm)

Since we currently don't validate the first sequence (we will in part 2 of 
kip-890), any transient error on the first produce can lead to out of order 
sequences that never recover.

Originally, KAFKA-14561 relied on the producer's retry mechanism for these 
transient issues, but until that is fixed, we may need to retry from in the 
AddPartitionsManager instead. We addressed the concurrent transactions, but 
there are other errors like coordinator loading that we could run into and see 
increased out of order issues.



由于我们目前尚未验证第一个序列(我们将在 kip-890 的第 2 部分中),因此第一个产品上的任何瞬态错误都可能导致永远无法恢复的无序序列。

最初,KAFKA-14561 依赖于生产者的重试机制来解决这些暂时性问题,但在修复之前,我们可能需要从 AddPartitionsManager 
中重试。我们解决了并发事务,但还有其他错误,例如协调器加载,我们可能会遇到这些错误,并看到更多的乱序问题。

  was:
KAFKA-14844 showed the destructive nature of a timeout on the first produce 
request for a topic partition (ie one that has no state in psm)

由于我们目前尚未验证第一个序列(我们将在 kip-890 的第 2 部分中),因此第一个产品上的任何瞬态错误都可能导致永远无法恢复的无序序列。

最初,KAFKA-14561 依赖于生产者的重试机制来解决这些暂时性问题,但在修复之前,我们可能需要从 AddPartitionsManager 
中重试。我们解决了并发事务,但还有其他错误,例如协调器加载,我们可能会遇到这些错误,并看到更多的乱序问题。


> Address timeouts and out of order sequences
> ---
>
> Key: KAFKA-14920
> URL: https://issues.apache.org/jira/browse/KAFKA-14920
> Project: Kafka
>  Issue Type: Sub-task
>Affects Versions: 3.6.0
>Reporter: Justine Olshan
>Assignee: Justine Olshan
>Priority: Blocker
> Fix For: 3.6.0
>
>
> KAFKA-14844 showed the destructive nature of a timeout on the first produce 
> request for a topic partition (ie one that has no state in psm)
> Since we currently don't validate the first sequence (we will in part 2 of 
> kip-890), any transient error on the first produce can lead to out of order 
> sequences that never recover.
> Originally, KAFKA-14561 relied on the producer's retry mechanism for these 
> transient issues, but until that is fixed, we may need to retry from in the 
> AddPartitionsManager instead. We addressed the concurrent transactions, but 
> there are other errors like coordinator loading that we could run into and 
> see increased out of order issues.
> 由于我们目前尚未验证第一个序列(我们将在 kip-890 的第 2 部分中),因此第一个产品上的任何瞬态错误都可能导致永远无法恢复的无序序列。
> 最初,KAFKA-14561 依赖于生产者的重试机制来解决这些暂时性问题,但在修复之前,我们可能需要从 AddPartitionsManager 
> 中重试。我们解决了并发事务,但还有其他错误,例如协调器加载,我们可能会遇到这些错误,并看到更多的乱序问题。



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (KAFKA-16217) Transactional producer stuck in IllegalStateException during close

2024-02-01 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813432#comment-17813432
 ] 

Justine Olshan edited comment on KAFKA-16217 at 2/2/24 1:36 AM:


Thanks [~calvinliu] for filing this bug. 
With [https://github.com/apache/kafka/pull/13591] if we are in committing state 
and try to abort when closing, we catch an error on IllegalStateException. At 
this point, it is not possible to complete the transaction. Given that we can't 
move past this state, we should be able to close.

The server will either abort the transaction due to timeout, or the new 
producer coming up with the same producer ID will abort the transaction.


was (Author: jolshan):
Thanks [~calvinliu] for filing this bug. 
With [https://github.com/apache/kafka/pull/13591] if we are in committing state 
and try to abort when closing, we will hit an IllegalStateException and enter 
fatal state. At this point, it is not possible to complete the transaction. 
Given that the fatal state means the producer must be closed, we should allow 
closing the producer in fatal state.

The server will either abort the transaction due to timeout, or the new 
producer coming up with the same producer ID will abort the transaction.

> Transactional producer stuck in IllegalStateException during close
> --
>
> Key: KAFKA-16217
> URL: https://issues.apache.org/jira/browse/KAFKA-16217
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, producer 
>Affects Versions: 3.7.0, 3.6.1
>Reporter: Calvin Liu
>Assignee: Kirk True
>Priority: Major
>  Labels: transactions
> Fix For: 3.6.2, 3.7.1
>
>
> The producer is stuck during the close. It keeps retrying to abort the 
> transaction but it never succeeds. 
> {code:java}
> [ERROR] 2024-02-01 17:21:22,804 [kafka-producer-network-thread | 
> producer-transaction-bench-transaction-id-f60SGdyRQGGFjdgg3vUgKg] 
> org.apache.kafka.clients.producer.internals.Sender run - [Producer 
> clientId=producer-transaction-ben
> ch-transaction-id-f60SGdyRQGGFjdgg3vUgKg, 
> transactionalId=transaction-bench-transaction-id-f60SGdyRQGGFjdgg3vUgKg] 
> Error in kafka producer I/O thread while aborting transaction:
> java.lang.IllegalStateException: Cannot attempt operation `abortTransaction` 
> because the previous call to `commitTransaction` timed out and must be retried
> at 
> org.apache.kafka.clients.producer.internals.TransactionManager.handleCachedTransactionRequestResult(TransactionManager.java:1138)
> at 
> org.apache.kafka.clients.producer.internals.TransactionManager.beginAbort(TransactionManager.java:323)
> at 
> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:274)
> at java.base/java.lang.Thread.run(Thread.java:1583)
> at org.apache.kafka.common.utils.KafkaThread.run(KafkaThread.java:66) 
> {code}
> With the additional log, I found the root cause. If the producer is in a bad 
> transaction state(in my case, the TransactionManager.pendingTransition was 
> set to commitTransaction and did not get cleaned), then the producer calls 
> close and tries to abort the existing transaction, the producer will get 
> stuck in the transaction abortion. It is related to the fix 
> [https://github.com/apache/kafka/pull/13591].
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-16217) Transactional producer stuck in IllegalStateException during close

2024-02-01 Thread Justine Olshan (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justine Olshan updated KAFKA-16217:
---
Affects Version/s: 3.6.1
   3.7.0

> Transactional producer stuck in IllegalStateException during close
> --
>
> Key: KAFKA-16217
> URL: https://issues.apache.org/jira/browse/KAFKA-16217
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 3.7.0, 3.6.1
>Reporter: Calvin Liu
>Priority: Major
>
> The producer is stuck during the close. It keeps retrying to abort the 
> transaction but it never succeeds. 
> {code:java}
> [ERROR] 2024-02-01 17:21:22,804 [kafka-producer-network-thread | 
> producer-transaction-bench-transaction-id-f60SGdyRQGGFjdgg3vUgKg] 
> org.apache.kafka.clients.producer.internals.Sender run - [Producer 
> clientId=producer-transaction-ben
> ch-transaction-id-f60SGdyRQGGFjdgg3vUgKg, 
> transactionalId=transaction-bench-transaction-id-f60SGdyRQGGFjdgg3vUgKg] 
> Error in kafka producer I/O thread while aborting transaction:
> java.lang.IllegalStateException: Cannot attempt operation `abortTransaction` 
> because the previous call to `commitTransaction` timed out and must be retried
> at 
> org.apache.kafka.clients.producer.internals.TransactionManager.handleCachedTransactionRequestResult(TransactionManager.java:1138)
> at 
> org.apache.kafka.clients.producer.internals.TransactionManager.beginAbort(TransactionManager.java:323)
> at 
> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:274)
> at java.base/java.lang.Thread.run(Thread.java:1583)
> at org.apache.kafka.common.utils.KafkaThread.run(KafkaThread.java:66) 
> {code}
> With the additional log, I found the root cause. If the producer is in a bad 
> transaction state(in my case, the TransactionManager.pendingTransition was 
> set to commitTransaction and did not get cleaned), then the producer calls 
> close and tries to abort the existing transaction, the producer will get 
> stuck in the transaction abortion. It is related to the fix 
> [https://github.com/apache/kafka/pull/13591].
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16217) Transactional producer stuck in IllegalStateException during close

2024-02-01 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813432#comment-17813432
 ] 

Justine Olshan commented on KAFKA-16217:


Thanks [~calvinliu] for filing this bug. 
With [https://github.com/apache/kafka/pull/13591] if we are in committing state 
and try to abort when closing, we will hit an IllegalStateException and enter 
fatal state. At this point, it is not possible to complete the transaction. 
Given that the fatal state means the producer must be closed, we should allow 
closing the producer in fatal state.

The server will either abort the transaction due to timeout, or the new 
producer coming up with the same producer ID will abort the transaction.

> Transactional producer stuck in IllegalStateException during close
> --
>
> Key: KAFKA-16217
> URL: https://issues.apache.org/jira/browse/KAFKA-16217
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Reporter: Calvin Liu
>Priority: Major
>
> The producer is stuck during the close. It keeps retrying to abort the 
> transaction but it never succeeds. 
> {code:java}
> [ERROR] 2024-02-01 17:21:22,804 [kafka-producer-network-thread | 
> producer-transaction-bench-transaction-id-f60SGdyRQGGFjdgg3vUgKg] 
> org.apache.kafka.clients.producer.internals.Sender run - [Producer 
> clientId=producer-transaction-ben
> ch-transaction-id-f60SGdyRQGGFjdgg3vUgKg, 
> transactionalId=transaction-bench-transaction-id-f60SGdyRQGGFjdgg3vUgKg] 
> Error in kafka producer I/O thread while aborting transaction:
> java.lang.IllegalStateException: Cannot attempt operation `abortTransaction` 
> because the previous call to `commitTransaction` timed out and must be retried
> at 
> org.apache.kafka.clients.producer.internals.TransactionManager.handleCachedTransactionRequestResult(TransactionManager.java:1138)
> at 
> org.apache.kafka.clients.producer.internals.TransactionManager.beginAbort(TransactionManager.java:323)
> at 
> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:274)
> at java.base/java.lang.Thread.run(Thread.java:1583)
> at org.apache.kafka.common.utils.KafkaThread.run(KafkaThread.java:66) 
> {code}
> With the additional log, I found the root cause. If the producer is in a bad 
> transaction state(in my case, the TransactionManager.pendingTransition was 
> set to commitTransaction and did not get cleaned), then the producer calls 
> close and tries to abort the existing transaction, the producer will get 
> stuck in the transaction abortion. It is related to the fix 
> [https://github.com/apache/kafka/pull/13591].
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16212) Cache partitions by TopicIdPartition instead of TopicPartition

2024-01-31 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17812822#comment-17812822
 ] 

Justine Olshan commented on KAFKA-16212:


When working on KIP-516 (topic IDs) there were many areas that topic IDs could 
help but changing them all in one go would have been difficult. I'm happy to 
see more steps in this direction :) 

> Cache partitions by TopicIdPartition instead of TopicPartition
> --
>
> Key: KAFKA-16212
> URL: https://issues.apache.org/jira/browse/KAFKA-16212
> Project: Kafka
>  Issue Type: Improvement
>Affects Versions: 3.7.0
>Reporter: Gaurav Narula
>Assignee: Omnia Ibrahim
>Priority: Major
>
> From the discussion in [PR 
> 15263|https://github.com/apache/kafka/pull/15263#discussion_r1471075201], it 
> would be better to cache {{allPartitions}} by {{TopicIdPartition}} instead of 
> {{TopicPartition}} to avoid ambiguity.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16204) Stray file core/00000000000000000001.snapshot created when running core tests

2024-01-29 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17811983#comment-17811983
 ] 

Justine Olshan commented on KAFKA-16204:


I've seen it when running just build + checkstyle + spotbugs. 

I didn't know if it was just me.

> Stray file core/0001.snapshot created when running core tests
> -
>
> Key: KAFKA-16204
> URL: https://issues.apache.org/jira/browse/KAFKA-16204
> Project: Kafka
>  Issue Type: Improvement
>  Components: core, unit tests
>Reporter: Mickael Maison
>Priority: Major
>
> When running the core tests I often get a file called 
> core/0001.snapshot created in my kafka folder. It looks like 
> one of the test does not clean its resources properly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-16122) TransactionsBounceTest -- server disconnected before response was received

2024-01-26 Thread Justine Olshan (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justine Olshan resolved KAFKA-16122.

Resolution: Fixed

> TransactionsBounceTest -- server disconnected before response was received
> --
>
> Key: KAFKA-16122
> URL: https://issues.apache.org/jira/browse/KAFKA-16122
> Project: Kafka
>  Issue Type: Test
>Reporter: Justine Olshan
>Assignee: Justine Olshan
>Priority: Major
>
> I noticed a ton of tests failing with 
> h4.  
> {code:java}
> Error  org.apache.kafka.common.KafkaException: Unexpected error in 
> TxnOffsetCommitResponse: The server disconnected before a response was 
> received.  {code}
> {code:java}
> Stacktrace  org.apache.kafka.common.KafkaException: Unexpected error in 
> TxnOffsetCommitResponse: The server disconnected before a response was 
> received.  at 
> app//org.apache.kafka.clients.producer.internals.TransactionManager$TxnOffsetCommitHandler.handleResponse(TransactionManager.java:1702)
>   at 
> app//org.apache.kafka.clients.producer.internals.TransactionManager$TxnRequestHandler.onComplete(TransactionManager.java:1236)
>   at 
> app//org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:154)
>   at 
> app//org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:608)
>   at app//org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:600) 
>  at 
> app//org.apache.kafka.clients.producer.internals.Sender.maybeSendAndPollTransactionalRequest(Sender.java:457)
>   at 
> app//org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:334)
>   at 
> app//org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:249)  
> at java.base@21.0.1/java.lang.Thread.run(Thread.java:1583) {code}
> The error indicates a network error which is retriable but the 
> TxnOffsetCommit handler doesn't expect this. 
> https://issues.apache.org/jira/browse/KAFKA-14417 addressed many of the other 
> requests but not this one. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15987) Refactor ReplicaManager code for transaction verification

2024-01-26 Thread Justine Olshan (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justine Olshan resolved KAFKA-15987.

Resolution: Fixed

> Refactor ReplicaManager code for transaction verification
> -
>
> Key: KAFKA-15987
> URL: https://issues.apache.org/jira/browse/KAFKA-15987
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Justine Olshan
>Assignee: Justine Olshan
>Priority: Major
>
> I started to do this in KAFKA-15784, but the diff was deemed too large and 
> confusing. I just wanted to file a followup ticket to reference this in code 
> for the areas that will be refactored.
>  
> I hope to tackle it immediately after.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-16192) Introduce usage of flexible records to coordinators

2024-01-25 Thread Justine Olshan (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justine Olshan updated KAFKA-16192:
---
Description: 
[KIP-915|https://cwiki.apache.org/confluence/display/KAFKA/KIP-915%3A+Txn+and+Group+Coordinator+Downgrade+Foundation]
 introduced flexible versions to the records used for the group and transaction 
coordinators.
However, the KIP did not update the record version used.

For 
[KIP-890|https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense]
 we intend to use flexible fields in the transaction state records. This 
requires a safe way to upgrade from non-flexible version records to flexible 
version records.

This change is implemented via MV or through a new feature type.

  was:
[KIP-915|https://cwiki.apache.org/confluence/display/KAFKA/KIP-915%3A+Txn+and+Group+Coordinator+Downgrade+Foundation]
 introduced flexible versions to the records used for the group and transaction 
coordinators.
However, the KIP did not update the record version used.

For 
[KIP-890|https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense]
 we intend to use flexible fields in the transaction state records. This 
requires a safe way to upgrade from non-flexible version records to flexible 
version records.

This change is implemented via MV.


> Introduce usage of flexible records to coordinators
> ---
>
> Key: KAFKA-16192
> URL: https://issues.apache.org/jira/browse/KAFKA-16192
> Project: Kafka
>  Issue Type: Task
>Reporter: Justine Olshan
>Assignee: Justine Olshan
>Priority: Major
>
> [KIP-915|https://cwiki.apache.org/confluence/display/KAFKA/KIP-915%3A+Txn+and+Group+Coordinator+Downgrade+Foundation]
>  introduced flexible versions to the records used for the group and 
> transaction coordinators.
> However, the KIP did not update the record version used.
> For 
> [KIP-890|https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense]
>  we intend to use flexible fields in the transaction state records. This 
> requires a safe way to upgrade from non-flexible version records to flexible 
> version records.
> This change is implemented via MV or through a new feature type.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (KAFKA-16122) TransactionsBounceTest -- server disconnected before response was received

2024-01-25 Thread Justine Olshan (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justine Olshan reassigned KAFKA-16122:
--

Assignee: Justine Olshan

> TransactionsBounceTest -- server disconnected before response was received
> --
>
> Key: KAFKA-16122
> URL: https://issues.apache.org/jira/browse/KAFKA-16122
> Project: Kafka
>  Issue Type: Test
>Reporter: Justine Olshan
>Assignee: Justine Olshan
>Priority: Major
>
> I noticed a ton of tests failing with 
> h4.  
> {code:java}
> Error  org.apache.kafka.common.KafkaException: Unexpected error in 
> TxnOffsetCommitResponse: The server disconnected before a response was 
> received.  {code}
> {code:java}
> Stacktrace  org.apache.kafka.common.KafkaException: Unexpected error in 
> TxnOffsetCommitResponse: The server disconnected before a response was 
> received.  at 
> app//org.apache.kafka.clients.producer.internals.TransactionManager$TxnOffsetCommitHandler.handleResponse(TransactionManager.java:1702)
>   at 
> app//org.apache.kafka.clients.producer.internals.TransactionManager$TxnRequestHandler.onComplete(TransactionManager.java:1236)
>   at 
> app//org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:154)
>   at 
> app//org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:608)
>   at app//org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:600) 
>  at 
> app//org.apache.kafka.clients.producer.internals.Sender.maybeSendAndPollTransactionalRequest(Sender.java:457)
>   at 
> app//org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:334)
>   at 
> app//org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:249)  
> at java.base@21.0.1/java.lang.Thread.run(Thread.java:1583) {code}
> The error indicates a network error which is retriable but the 
> TxnOffsetCommit handler doesn't expect this. 
> https://issues.apache.org/jira/browse/KAFKA-14417 addressed many of the other 
> requests but not this one. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-16192) Introduce usage of flexible records to coordinators

2024-01-24 Thread Justine Olshan (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justine Olshan updated KAFKA-16192:
---
Description: 
[KIP-915|https://cwiki.apache.org/confluence/display/KAFKA/KIP-915%3A+Txn+and+Group+Coordinator+Downgrade+Foundation]
 introduced flexible versions to the records used for the group and transaction 
coordinators.
However, the KIP did not update the record version used.

For 
[KIP-890|https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense]
 we intend to use flexible fields in the transaction state records. This 
requires a safe way to upgrade from non-flexible version records to flexible 
version records.

This change is implemented via MV.

  was:
[KIP-915| 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-915%3A+Txn+and+Group+Coordinator+Downgrade+Foundation]
 introduced flexible versions to the records used for the group and transaction 
coordinators.
However, the KIP did not update the record version used.

For 
[KIP-890|https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense]
 we intend to use flexible fields in the transaction state records. This 
requires a safe way to upgrade from non-flexible version records to flexible 
version records.

Typically this is done as a message format bump. There may be an option to make 
this change using MV since if the readers of the records are internal and not 
external consumers.


> Introduce usage of flexible records to coordinators
> ---
>
> Key: KAFKA-16192
> URL: https://issues.apache.org/jira/browse/KAFKA-16192
> Project: Kafka
>  Issue Type: Task
>Reporter: Justine Olshan
>Assignee: Justine Olshan
>Priority: Major
>
> [KIP-915|https://cwiki.apache.org/confluence/display/KAFKA/KIP-915%3A+Txn+and+Group+Coordinator+Downgrade+Foundation]
>  introduced flexible versions to the records used for the group and 
> transaction coordinators.
> However, the KIP did not update the record version used.
> For 
> [KIP-890|https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense]
>  we intend to use flexible fields in the transaction state records. This 
> requires a safe way to upgrade from non-flexible version records to flexible 
> version records.
> This change is implemented via MV.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16192) Introduce usage of flexible records to coordinators

2024-01-24 Thread Justine Olshan (Jira)
Justine Olshan created KAFKA-16192:
--

 Summary: Introduce usage of flexible records to coordinators
 Key: KAFKA-16192
 URL: https://issues.apache.org/jira/browse/KAFKA-16192
 Project: Kafka
  Issue Type: Task
Reporter: Justine Olshan
Assignee: Justine Olshan


[KIP-915| 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-915%3A+Txn+and+Group+Coordinator+Downgrade+Foundation]
 introduced flexible versions to the records used for the group and transaction 
coordinators.
However, the KIP did not update the record version used.

For 
[KIP-890|https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense]
 we intend to use flexible fields in the transaction state records. This 
requires a safe way to upgrade from non-flexible version records to flexible 
version records.

Typically this is done as a message format bump. There may be an option to make 
this change using MV since if the readers of the records are internal and not 
external consumers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15104) Flaky test MetadataQuorumCommandTest for method testDescribeQuorumReplicationSuccessful

2024-01-22 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17809648#comment-17809648
 ] 

Justine Olshan commented on KAFKA-15104:


I saw this fail many times here:  
[https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15183/3/tests]
 

> Flaky test MetadataQuorumCommandTest for method 
> testDescribeQuorumReplicationSuccessful
> ---
>
> Key: KAFKA-15104
> URL: https://issues.apache.org/jira/browse/KAFKA-15104
> Project: Kafka
>  Issue Type: Bug
>  Components: tools
>Affects Versions: 3.5.0
>Reporter: Josep Prat
>Priority: Major
>  Labels: flaky-test
>
> The MetadataQuorumCommandTest has become flaky on CI, I saw this failing: 
> org.apache.kafka.tools.MetadataQuorumCommandTest.[1] Type=Raft-Combined, 
> Name=testDescribeQuorumReplicationSuccessful, MetadataVersion=3.6-IV0, 
> Security=PLAINTEXT
> Link to the CI: 
> https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-13865/2/testReport/junit/org.apache.kafka.tools/MetadataQuorumCommandTest/Build___JDK_8_and_Scala_2_121__Type_Raft_Combined__Name_testDescribeQuorumReplicationSuccessful__MetadataVersion_3_6_IV0__Security_PLAINTEXT/
>  
> h3. Error Message
> {code:java}
> java.util.concurrent.ExecutionException: java.lang.RuntimeException: Received 
> a fatal error while waiting for the controller to acknowledge that we are 
> caught up{code}
> h3. Stacktrace
> {code:java}
> java.util.concurrent.ExecutionException: java.lang.RuntimeException: Received 
> a fatal error while waiting for the controller to acknowledge that we are 
> caught up at java.util.concurrent.FutureTask.report(FutureTask.java:122) at 
> java.util.concurrent.FutureTask.get(FutureTask.java:192) at 
> kafka.testkit.KafkaClusterTestKit.startup(KafkaClusterTestKit.java:419) at 
> kafka.test.junit.RaftClusterInvocationContext.lambda$getAdditionalExtensions$5(RaftClusterInvocationContext.java:115)
>  at 
> org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeBeforeTestExecutionCallbacks$5(TestMethodTestDescriptor.java:191)
>  at 
> org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeBeforeMethodsOrCallbacksUntilExceptionOccurs$6(TestMethodTestDescriptor.java:202)
>  at 
> org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
>  at 
> org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeBeforeMethodsOrCallbacksUntilExceptionOccurs(TestMethodTestDescriptor.java:202)
>  at 
> org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeBeforeTestExecutionCallbacks(TestMethodTestDescriptor.java:190)
>  at 
> org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:136){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16156) System test failing for new consumer on endOffsets with negative timestamps

2024-01-17 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17807858#comment-17807858
 ] 

Justine Olshan commented on KAFKA-16156:


The transactional copier is pretty old. We can probably update it. If I get 
time I can take a look.

> System test failing for new consumer on endOffsets with negative timestamps
> ---
>
> Key: KAFKA-16156
> URL: https://issues.apache.org/jira/browse/KAFKA-16156
> Project: Kafka
>  Issue Type: Sub-task
>  Components: clients
>Reporter: Lianet Magrans
>Priority: Major
>
> TransactionalMessageCopier run with 3.7 new consumer fails with "Invalid 
> negative timestamp".
> Trace:
> [2024-01-15 07:42:33,932] TRACE [Consumer 
> clientId=consumer-transactions-test-consumer-group-1, 
> groupId=transactions-test-consumer-group] Received ListOffsetResponse 
> ListOffsetsResponseData(throttleTimeMs=0, 
> topics=[ListOffsetsTopicResponse(name='input-topic', 
> partitions=[ListOffsetsPartitionResponse(partitionIndex=0, errorCode=0, 
> oldStyleOffsets=[], timestamp=-1, offset=42804, leaderEpoch=0)])]) from 
> broker worker2:9092 (id: 2 rack: null) 
> (org.apache.kafka.clients.consumer.internals.OffsetsRequestManager)
> [2024-01-15 07:42:33,932] DEBUG [Consumer 
> clientId=consumer-transactions-test-consumer-group-1, 
> groupId=transactions-test-consumer-group] Handling ListOffsetResponse 
> response for input-topic-0. Fetched offset 42804, timestamp -1 
> (org.apache.kafka.clients.consumer.internals.OffsetFetcherUtils)
> [2024-01-15 07:42:33,932] TRACE [Consumer 
> clientId=consumer-transactions-test-consumer-group-1, 
> groupId=transactions-test-consumer-group] Updating last stable offset for 
> partition input-topic-0 to 42804 
> (org.apache.kafka.clients.consumer.internals.OffsetFetcherUtils)
> [2024-01-15 07:42:33,933] DEBUG [Consumer 
> clientId=consumer-transactions-test-consumer-group-1, 
> groupId=transactions-test-consumer-group] Fetch offsets completed 
> successfully for partitions and timestamps {input-topic-0=-1}. Result 
> org.apache.kafka.clients.consumer.internals.OffsetFetcherUtils$ListOffsetResult@bf2a862
>  (org.apache.kafka.clients.consumer.internals.OffsetsRequestManager)
> [2024-01-15 07:42:33,933] TRACE [Consumer 
> clientId=consumer-transactions-test-consumer-group-1, 
> groupId=transactions-test-consumer-group] No events to process 
> (org.apache.kafka.clients.consumer.internals.events.EventProcessor)
> [2024-01-15 07:42:33,933] ERROR Shutting down after unexpected error in event 
> loop (org.apache.kafka.tools.TransactionalMessageCopier)
> org.apache.kafka.common.KafkaException: java.lang.IllegalArgumentException: 
> Invalid negative timestamp
>   at 
> org.apache.kafka.clients.consumer.internals.ConsumerUtils.maybeWrapAsKafkaException(ConsumerUtils.java:234)
>   at 
> org.apache.kafka.clients.consumer.internals.ConsumerUtils.getResult(ConsumerUtils.java:212)
>   at 
> org.apache.kafka.clients.consumer.internals.events.CompletableApplicationEvent.get(CompletableApplicationEvent.java:44)
>   at 
> org.apache.kafka.clients.consumer.internals.events.ApplicationEventHandler.addAndGet(ApplicationEventHandler.java:113)
>   at 
> org.apache.kafka.clients.consumer.internals.AsyncKafkaConsumer.beginningOrEndOffset(AsyncKafkaConsumer.java:1134)
>   at 
> org.apache.kafka.clients.consumer.internals.AsyncKafkaConsumer.endOffsets(AsyncKafkaConsumer.java:1113)
>   at 
> org.apache.kafka.clients.consumer.internals.AsyncKafkaConsumer.endOffsets(AsyncKafkaConsumer.java:1108)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.endOffsets(KafkaConsumer.java:1651)
>   at 
> org.apache.kafka.tools.TransactionalMessageCopier.messagesRemaining(TransactionalMessageCopier.java:246)
>   at 
> org.apache.kafka.tools.TransactionalMessageCopier.runEventLoop(TransactionalMessageCopier.java:342)
>   at 
> org.apache.kafka.tools.TransactionalMessageCopier.main(TransactionalMessageCopier.java:292)
> Caused by: java.lang.IllegalArgumentException: Invalid negative timestamp
>   at 
> org.apache.kafka.clients.consumer.OffsetAndTimestamp.(OffsetAndTimestamp.java:39)
>   at 
> org.apache.kafka.clients.consumer.internals.OffsetFetcherUtils.buildOffsetsForTimesResult(OffsetFetcherUtils.java:253)
>   at 
> org.apache.kafka.clients.consumer.internals.OffsetsRequestManager.lambda$fetchOffsets$1(OffsetsRequestManager.java:181)
>   at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602)
>   at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
>   at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>   at 
> 

[jira] [Created] (KAFKA-16122) TransactionsBounceTest -- server disconnected before response was received

2024-01-12 Thread Justine Olshan (Jira)
Justine Olshan created KAFKA-16122:
--

 Summary: TransactionsBounceTest -- server disconnected before 
response was received
 Key: KAFKA-16122
 URL: https://issues.apache.org/jira/browse/KAFKA-16122
 Project: Kafka
  Issue Type: Test
Reporter: Justine Olshan


I noticed a ton of tests failing with 


h4.  
{code:java}
Error  org.apache.kafka.common.KafkaException: Unexpected error in 
TxnOffsetCommitResponse: The server disconnected before a response was 
received.  {code}
{code:java}
Stacktrace  org.apache.kafka.common.KafkaException: Unexpected error in 
TxnOffsetCommitResponse: The server disconnected before a response was 
received.  at 
app//org.apache.kafka.clients.producer.internals.TransactionManager$TxnOffsetCommitHandler.handleResponse(TransactionManager.java:1702)
  at 
app//org.apache.kafka.clients.producer.internals.TransactionManager$TxnRequestHandler.onComplete(TransactionManager.java:1236)
  at 
app//org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:154)
  at 
app//org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:608)
  at app//org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:600)  
at 
app//org.apache.kafka.clients.producer.internals.Sender.maybeSendAndPollTransactionalRequest(Sender.java:457)
  at 
app//org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:334)
  at 
app//org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:249)  
at java.base@21.0.1/java.lang.Thread.run(Thread.java:1583) {code}


The error indicates a network error which is retriable but the TxnOffsetCommit 
handler doesn't expect this. 

https://issues.apache.org/jira/browse/KAFKA-14417 addressed many of the other 
requests but not this one. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16120) Partition reassignments in ZK migration dual write leaves stray partitions

2024-01-12 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17806129#comment-17806129
 ] 

Justine Olshan commented on KAFKA-16120:


Do you know if https://issues.apache.org/jira/browse/KAFKA-14616 is related?

> Partition reassignments in ZK migration dual write leaves stray partitions
> --
>
> Key: KAFKA-16120
> URL: https://issues.apache.org/jira/browse/KAFKA-16120
> Project: Kafka
>  Issue Type: Bug
>Reporter: David Mao
>Priority: Major
>
> When a reassignment is completed in ZK migration dual-write mode, the 
> `StopReplica` sent by the kraft quorum migration propagator is sent with 
> `delete = false` for deleted replicas when processing the topic delta. This 
> results in stray replicas.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (KAFKA-16120) Partition reassignments in ZK migration dual write leaves stray partitions

2024-01-12 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17806129#comment-17806129
 ] 

Justine Olshan edited comment on KAFKA-16120 at 1/12/24 5:18 PM:
-

Do you know if https://issues.apache.org/jira/browse/KAFKA-14616 is related?

 

Oh – I see this specifically for migration and I believe the other one is not.


was (Author: jolshan):
Do you know if https://issues.apache.org/jira/browse/KAFKA-14616 is related?

> Partition reassignments in ZK migration dual write leaves stray partitions
> --
>
> Key: KAFKA-16120
> URL: https://issues.apache.org/jira/browse/KAFKA-16120
> Project: Kafka
>  Issue Type: Bug
>Reporter: David Mao
>Priority: Major
>
> When a reassignment is completed in ZK migration dual-write mode, the 
> `StopReplica` sent by the kraft quorum migration propagator is sent with 
> `delete = false` for deleted replicas when processing the topic delta. This 
> results in stray replicas.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16078) InterBrokerProtocolVersion defaults to non-production MetadataVersion

2024-01-03 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802308#comment-17802308
 ] 

Justine Olshan commented on KAFKA-16078:


We should really get to writing the KIP for the "non-production" vs 
"production" MVs.

> InterBrokerProtocolVersion defaults to non-production MetadataVersion
> -
>
> Key: KAFKA-16078
> URL: https://issues.apache.org/jira/browse/KAFKA-16078
> Project: Kafka
>  Issue Type: Bug
>Reporter: David Arthur
>Assignee: David Arthur
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite

2023-12-28 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17801029#comment-17801029
 ] 

Justine Olshan commented on KAFKA-16052:


Hey Luke – thanks for your help with the non-mocked replica manager. Did you 
mean to leave this comment about detecting leaking thread on a different jira? 
I think it is unrelated to the mock replica manager right?

I will take a look at your and Divij's PRs to fix the replica manager.

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot 
> 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot 
> 2023-12-28 at 00.13.06.png, Screenshot 2023-12-28 at 00.18.56.png, Screenshot 
> 2023-12-28 at 11.26.03.png, Screenshot 2023-12-28 at 11.26.09.png, newRM.patch
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800911#comment-17800911
 ] 

Justine Olshan commented on KAFKA-16052:


What do we think about lowering the number of sequences we test (say from 10 to 
3 or 4) and the number of groups from 10 to 5 or so. 
That would lower the number of operations from 35,000 to 7,000. This is used in 
both the GroupCoordinator and the TransactionCoordinator test too. (The 
transaction coordinator doesn't have groups, but transactions). We could also 
look at some of the other tests in the suite. (To be honest, I never really 
understood how this suite was actually testing concurrency with all the mocks 
and redefined operations)

We probably need to figure out why we store all of these mock invocations since 
we are just executing the methods anyway. But I think lowering the number of 
operations should give us some breathing room for the OOMs. 

What do you think [~divijvaidya] [~ijuma] 

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot 
> 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot 
> 2023-12-28 at 00.13.06.png, Screenshot 2023-12-28 at 00.18.56.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800909#comment-17800909
 ] 

Justine Olshan edited comment on KAFKA-16052 at 12/28/23 2:35 AM:
--

So I realized that every test runs thousands of operations that call not only 
replicaManager.maybeStartTransactionVerificationForPartition but also other 
replicaManager methods like replicaManager appendForGroup thousands of times. 

I'm trying to figure out why this is causing regressions now. 

For the testConcurrentRandomSequence test, we call these methods approximately 
35,000 times.

We do 10 random sequences and 10 ordered sequences. Each sequence consists of 7 
operations for the 250 group members (nthreads(5) * 5 * 10). 
So 20 * 7 * 250 is 35,000. Not sure if we need this many operations!


was (Author: jolshan):
So I realized that every test runs thousands of operations that call not only 
replicaManager.maybeStartTransactionVerificationForPartition but also other 
replicaManager methods like replicaManager appendForGroup thousands of times. 

I'm trying to figure out why this is causing regressions now. 

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot 
> 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot 
> 2023-12-28 at 00.13.06.png, Screenshot 2023-12-28 at 00.18.56.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800909#comment-17800909
 ] 

Justine Olshan commented on KAFKA-16052:


So I realized that every test runs thousands of operations that call not only 
replicaManager.maybeStartTransactionVerificationForPartition but also other 
replicaManager methods like replicaManager appendForGroup thousands of times. 

I'm trying to figure out why this is causing regressions now. 

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot 
> 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot 
> 2023-12-28 at 00.13.06.png, Screenshot 2023-12-28 at 00.18.56.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800903#comment-17800903
 ] 

Justine Olshan edited comment on KAFKA-16052 at 12/28/23 2:05 AM:
--

It looks like the method that is actually being called is – 
replicaManager.maybeStartTransactionVerificationForPartition.

This is overridden in TestReplicaManager to just do the callback. I'm not sure 
why it is storing all of these though. I will look into that next.


was (Author: jolshan):
It looks like the method that is actually being called is – 
replicaManager.maybeStartTransactionVerificationForPartition there.

This is overridden to just do the callback. I'm not sure why it is storing all 
of these though. I will look into that next.

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot 
> 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot 
> 2023-12-28 at 00.13.06.png, Screenshot 2023-12-28 at 00.18.56.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800903#comment-17800903
 ] 

Justine Olshan commented on KAFKA-16052:


It looks like the method that is actually being called is – 
replicaManager.maybeStartTransactionVerificationForPartition there.

This is overridden to just do the callback. I'm not sure why it is storing all 
of these though. I will look into that next.

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot 
> 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot 
> 2023-12-28 at 00.13.06.png, Screenshot 2023-12-28 at 00.18.56.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800902#comment-17800902
 ] 

Justine Olshan commented on KAFKA-16052:


Thanks Divij. This might be a few things then since the handleTxnCommitOffsets 
is not in the TransactionCoordinator test. 
I will look into this though. 

I believe the OOMs started before I changed the transactional offset commit 
flow, but perhaps it is still related to something else I changed.

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot 
> 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot 
> 2023-12-28 at 00.13.06.png, Screenshot 2023-12-28 at 00.18.56.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800880#comment-17800880
 ] 

Justine Olshan commented on KAFKA-16052:


There are only four tests in the group coordinator suite, so it is surprising 
to hit OOM on those. But trying the clearInlineMocks should be a low hanging 
fruit that could help a bit while we figure out why the mocks are causing so 
many issues in the first place.

I have a draft PR with the change and we can look at how the builds do: 
https://github.com/apache/kafka/pull/15081

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot 
> 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800877#comment-17800877
 ] 

Justine Olshan commented on KAFKA-16052:


Ah – good point [~ijuma]. I will look into that as well.

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot 
> 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800865#comment-17800865
 ] 

Justine Olshan commented on KAFKA-16052:


Taking a look at removing the mock as well. There are quite a few parts of the 
initialization that we don't need to do for the tests. 

For example:


val replicaFetcherManager = createReplicaFetcherManager(metrics, time, 
threadNamePrefix, quotaManagers.follower)
private[server] val replicaAlterLogDirsManager = 
createReplicaAlterLogDirsManager(quotaManagers.alterLogDirs, brokerTopicStats)

We can do these if necessary, but it also starts other threads that we probably 
don't want running for the test.

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot 
> 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800855#comment-17800855
 ] 

Justine Olshan edited comment on KAFKA-16052 at 12/27/23 7:20 PM:
--

Doing a quick scan at the InterceptedInvocations, I see many with the method 
numDelayedDelete which is a method on DelayedOperations. I will look into that 
avenue for a bit.


was (Author: jolshan):
Doing a quick scan at the InterceptedInvocations, I see many with the method 
numDelayedDelete which is a method on DelayedOperations. I will look into that 
avenue for a bit.

EDIT: It doesn't seem like we mock DelayedOperations in these tests, so not 
sure if this is something else.

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot 
> 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800855#comment-17800855
 ] 

Justine Olshan edited comment on KAFKA-16052 at 12/27/23 7:19 PM:
--

Doing a quick scan at the InterceptedInvocations, I see many with the method 
numDelayedDelete which is a method on DelayedOperations. I will look into that 
avenue for a bit.

EDIT: It doesn't seem like we mock DelayedOperations in these tests, so not 
sure if this is something else.


was (Author: jolshan):
Doing a quick scan at the InterceptedInvocations, I see many with the method 
numDelayedDelete which is a method on DelayedOperations. I will look into that 
avenue for a bit.

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot 
> 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800855#comment-17800855
 ] 

Justine Olshan commented on KAFKA-16052:


Doing a quick scan at the InterceptedInvocations, I see many with the method 
numDelayedDelete which is a method on DelayedOperations. I will look into that 
avenue for a bit.

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot 
> 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800852#comment-17800852
 ] 

Justine Olshan commented on KAFKA-16052:


I can also take a look at the heap dump if that is helpful.

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot 
> 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite

2023-12-27 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800851#comment-17800851
 ] 

Justine Olshan commented on KAFKA-16052:


Thanks Divij for the digging. These tests share the common 
AbstractCoordinatorCouncurrencyTest. I wonder if the mock in question is there.

We do mock replica manager, but in an interesting way


replicaManager = mock(classOf[TestReplicaManager], 
withSettings().defaultAnswer(CALLS_REAL_METHODS))

> OOM in Kafka test suite
> ---
>
> Key: KAFKA-16052
> URL: https://issues.apache.org/jira/browse/KAFKA-16052
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.7.0
>Reporter: Divij Vaidya
>Priority: Major
> Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot 
> 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot 
> 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png
>
>
> *Problem*
> Our test suite is failing with frequent OOM. Discussion in the mailing list 
> is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] 
> *Setup*
> To find the source of leaks, I ran the :core:test build target with a single 
> thread (see below on how to do it) and attached a profiler to it. This Jira 
> tracks the list of action items identified from the analysis.
> How to run tests using a single thread:
> {code:java}
> diff --git a/build.gradle b/build.gradle
> index f7abbf4f0b..81df03f1ee 100644
> --- a/build.gradle
> +++ b/build.gradle
> @@ -74,9 +74,8 @@ ext {
>        "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED"
>      )-  maxTestForks = project.hasProperty('maxParallelForks') ? 
> maxParallelForks.toInteger() : Runtime.runtime.availableProcessors()
> -  maxScalacThreads = project.hasProperty('maxScalacThreads') ? 
> maxScalacThreads.toInteger() :
> -      Math.min(Runtime.runtime.availableProcessors(), 8)
> +  maxTestForks = 1
> +  maxScalacThreads = 1
>    userIgnoreFailures = project.hasProperty('ignoreFailures') ? 
> ignoreFailures : false   userMaxTestRetries = 
> project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0
> diff --git a/gradle.properties b/gradle.properties
> index 4880248cac..ee4b6e3bc1 100644
> --- a/gradle.properties
> +++ b/gradle.properties
> @@ -30,4 +30,4 @@ scalaVersion=2.13.12
>  swaggerVersion=2.2.8
>  task=build
>  org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC
> -org.gradle.parallel=true
> +org.gradle.parallel=false {code}
> *Result of experiment*
> This is how the heap memory utilized looks like, starting from tens of MB to 
> ending with 1.5GB (with spikes of 2GB) of heap being used as the test 
> executes. Note that the total number of threads also increases but it does 
> not correlate with sharp increase in heap memory usage. The heap dump is 
> available at 
> [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln=0]
>  
> !Screenshot 2023-12-27 at 14.22.21.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-16012) Incomplete range assignment in consumer

2023-12-22 Thread Justine Olshan (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justine Olshan resolved KAFKA-16012.

Resolution: Fixed

> Incomplete range assignment in consumer
> ---
>
> Key: KAFKA-16012
> URL: https://issues.apache.org/jira/browse/KAFKA-16012
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Assignee: Philip Nee
>Priority: Blocker
> Fix For: 3.7.0
>
>
> We were looking into test failures here: 
> https://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/system-test-kafka-branch-builder--1702475525--jolshan--kafka-15784--7cad567675/2023-12-13--001./2023-12-13–001./report.html.
>  
> Here is the first failure in the report:
> {code:java}
> 
> test_id:    
> kafkatest.tests.core.group_mode_transactions_test.GroupModeTransactionsTest.test_transactions.failure_mode=clean_bounce.bounce_target=brokers
> status:     FAIL
> run time:   3 minutes 4.950 seconds
>     TimeoutError('Consumer consumed only 88223 out of 10 messages in 
> 90s') {code}
>  
> We traced the failure to an apparent bug during the last rebalance before the 
> group became empty. The last remaining instance seems to receive an 
> incomplete assignment which prevents it from completing expected consumption 
> on some partitions. Here is the rebalance from the coordinator's perspective:
> {code:java}
> server.log.2023-12-13-04:[2023-12-13 04:58:56,987] INFO [GroupCoordinator 3]: 
> Stabilized group grouped-transactions-test-consumer-group generation 5 
> (__consumer_offsets-2) with 1 members 
> (kafka.coordinator.group.GroupCoordinator)
> server.log.2023-12-13-04:[2023-12-13 04:58:56,990] INFO [GroupCoordinator 3]: 
> Assignment received from leader 
> consumer-grouped-transactions-test-consumer-group-1-2164f472-93f3-4176-af3f-23d4ed8b37fd
>  for group grouped-transactions-test-consumer-group for generation 5. The 
> group has 1 members, 0 of which are static. 
> (kafka.coordinator.group.GroupCoordinator) {code}
> The group is down to one member in generation 5. In the previous generation, 
> the consumer in question reported this assignment:
> {code:java}
> // Gen 4: we've got partitions 0-4
> [2023-12-13 04:58:52,631] DEBUG [Consumer 
> clientId=consumer-grouped-transactions-test-consumer-group-1, 
> groupId=grouped-transactions-test-consumer-group] Executing onJoinComplete 
> with generation 4 and memberId 
> consumer-grouped-transactions-test-consumer-group-1-2164f472-93f3-4176-af3f-23d4ed8b37fd
>  (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
> [2023-12-13 04:58:52,631] INFO [Consumer 
> clientId=consumer-grouped-transactions-test-consumer-group-1, 
> groupId=grouped-transactions-test-consumer-group] Notifying assignor about 
> the new Assignment(partitions=[input-topic-0, input-topic-1, input-topic-2, 
> input-topic-3, input-topic-4]) 
> (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator) {code}
> However, in generation 5, we seem to be assigned only one partition:
> {code:java}
> // Gen 5: Now we have only partition 1? But aren't we the last member in the 
> group?
> [2023-12-13 04:58:56,954] DEBUG [Consumer 
> clientId=consumer-grouped-transactions-test-consumer-group-1, 
> groupId=grouped-transactions-test-consumer-group] Executing onJoinComplete 
> with generation 5 and memberId 
> consumer-grouped-transactions-test-consumer-group-1-2164f472-93f3-4176-af3f-23d4ed8b37fd
>  (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
> [2023-12-13 04:58:56,955] INFO [Consumer 
> clientId=consumer-grouped-transactions-test-consumer-group-1, 
> groupId=grouped-transactions-test-consumer-group] Notifying assignor about 
> the new Assignment(partitions=[input-topic-1]) 
> (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator) {code}
> The assignment type is range from the JoinGroup for generation 5. The decoded 
> metadata sent by the consumer is this:
> {code:java}
> Subscription(topics=[input-topic], ownedPartitions=[], groupInstanceId=null, 
> generationId=4, rackId=null) {code}
> Here is the decoded assignment from the SyncGroup:
> {code:java}
> Assignment(partitions=[input-topic-1]) {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-16045) ZkMigrationIntegrationTest.testMigrateTopicDeletion flaky

2023-12-21 Thread Justine Olshan (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justine Olshan updated KAFKA-16045:
---
Labels: flaky-test  (was: )

> ZkMigrationIntegrationTest.testMigrateTopicDeletion flaky
> -
>
> Key: KAFKA-16045
> URL: https://issues.apache.org/jira/browse/KAFKA-16045
> Project: Kafka
>  Issue Type: Test
>Reporter: Justine Olshan
>Priority: Major
>  Labels: flaky-test
>
> I'm seeing ZkMigrationIntegrationTest.testMigrateTopicDeletion fail for many 
> builds. I believe it is also causing a thread leak because on most runs where 
> it fails I also see ReplicaManager tests also fail with extra threads. 
> The test always fails 
> `org.opentest4j.AssertionFailedError: Timed out waiting for topics to be 
> deleted`
> gradle enterprise link:
> [https://ge.apache.org/scans/tests?search.names=Git%20branch[…]lues=trunk=kafka.zk.ZkMigrationIntegrationTest|https://ge.apache.org/scans/tests?search.names=Git%20branch=P28D=kafka=America%2FLos_Angeles=trunk=kafka.zk.ZkMigrationIntegrationTest]
> recent pr: 
> [https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15023/18/tests/]
> trunk builds: 
> [https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/2502/tests],
>  
> [https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/2501/tests]
>  (edited) 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16045) ZkMigrationIntegrationTest.testMigrateTopicDeletion flaky

2023-12-21 Thread Justine Olshan (Jira)
Justine Olshan created KAFKA-16045:
--

 Summary: ZkMigrationIntegrationTest.testMigrateTopicDeletion flaky
 Key: KAFKA-16045
 URL: https://issues.apache.org/jira/browse/KAFKA-16045
 Project: Kafka
  Issue Type: Test
Reporter: Justine Olshan


I'm seeing ZkMigrationIntegrationTest.testMigrateTopicDeletion fail for many 
builds. I believe it is also causing a thread leak because on most runs where 
it fails I also see ReplicaManager tests also fail with extra threads. 

The test always fails 
`org.opentest4j.AssertionFailedError: Timed out waiting for topics to be 
deleted`


gradle enterprise link:

[https://ge.apache.org/scans/tests?search.names=Git%20branch[…]lues=trunk=kafka.zk.ZkMigrationIntegrationTest|https://ge.apache.org/scans/tests?search.names=Git%20branch=P28D=kafka=America%2FLos_Angeles=trunk=kafka.zk.ZkMigrationIntegrationTest]

recent pr: 
[https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15023/18/tests/]
trunk builds: 
[https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/2502/tests],
 
[https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/2501/tests]
 (edited) 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-12670) KRaft support for unclean.leader.election.enable

2023-12-18 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-12670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17798270#comment-17798270
 ] 

Justine Olshan commented on KAFKA-12670:


This ticket is a little stale. Kraft does support manual unclean leader 
election, but not automatic through the config.

Here is the reply I also shared on the mailing list:

When I asked some of the folks who worked on Kraft about this, they 
communicated to me that it was intentional to make unclean leader election a 
manual action.

I think that for folks that want to prioritize availability over durability, 
the aggressive recovery strategy from KIP-966 should be preferable to the old 
unclean leader election configuration. 
[https://cwiki.apache.org/confluence/display/KAFKA/KIP-966%3A+Eligible+Leader+Replicas#KIP966:EligibleLeaderReplicas-Uncleanrecovery]

Let me know if we don't think this is sufficient.

> KRaft support for unclean.leader.election.enable
> 
>
> Key: KAFKA-12670
> URL: https://issues.apache.org/jira/browse/KAFKA-12670
> Project: Kafka
>  Issue Type: Task
>Reporter: Colin McCabe
>Assignee: Ryan Dielhenn
>Priority: Major
>
> Implement KRaft support for the unclean.leader.election.enable 
> configurations.  These configurations can be set at the topic, broker, or 
> cluster level.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15784) Ensure atomicity of in memory update and write when transactionally committing offsets

2023-12-14 Thread Justine Olshan (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justine Olshan resolved KAFKA-15784.

Resolution: Fixed

> Ensure atomicity of in memory update and write when transactionally 
> committing offsets
> --
>
> Key: KAFKA-15784
> URL: https://issues.apache.org/jira/browse/KAFKA-15784
> Project: Kafka
>  Issue Type: Sub-task
>Affects Versions: 3.7.0
>Reporter: Justine Olshan
>Assignee: Justine Olshan
>Priority: Blocker
>
> [https://github.com/apache/kafka/pull/14370] (KAFKA-15449) removed the 
> locking around validating, updating state, and writing to the log 
> transactional offset commits. (The verification causes us to release the lock)
> This was discovered in the discussion of 
> [https://github.com/apache/kafka/pull/14629] (KAFKA-15653).
> Since KAFKA-15653 is needed for 3.5.1, it makes sense to address the locking 
> issue separately with this ticket. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-15987) Refactor ReplicaManager code for transaction verification

2023-12-08 Thread Justine Olshan (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justine Olshan updated KAFKA-15987:
---
Summary: Refactor ReplicaManager code for transaction verification  (was: 
Refactor ReplicaManager code)

> Refactor ReplicaManager code for transaction verification
> -
>
> Key: KAFKA-15987
> URL: https://issues.apache.org/jira/browse/KAFKA-15987
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Justine Olshan
>Assignee: Justine Olshan
>Priority: Major
>
> I started to do this in KAFKA-15784, but the diff was deemed too large and 
> confusing. I just wanted to file a followup ticket to reference this in code 
> for the areas that will be refactored.
>  
> I hope to tackle it immediately after.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (KAFKA-15987) Refactor ReplicaManager code

2023-12-07 Thread Justine Olshan (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justine Olshan reassigned KAFKA-15987:
--

Assignee: Justine Olshan

> Refactor ReplicaManager code
> 
>
> Key: KAFKA-15987
> URL: https://issues.apache.org/jira/browse/KAFKA-15987
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Justine Olshan
>Assignee: Justine Olshan
>Priority: Major
>
> I started to do this in KAFKA-15784, but the diff was deemed too large and 
> confusing. I just wanted to file a followup ticket to reference this in code 
> for the areas that will be refactored.
>  
> I hope to tackle it immediately after.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15987) Refactor ReplicaManager code

2023-12-07 Thread Justine Olshan (Jira)
Justine Olshan created KAFKA-15987:
--

 Summary: Refactor ReplicaManager code
 Key: KAFKA-15987
 URL: https://issues.apache.org/jira/browse/KAFKA-15987
 Project: Kafka
  Issue Type: Sub-task
Reporter: Justine Olshan


I started to do this in KAFKA-15784, but the diff was deemed too large and 
confusing. I just wanted to file a followup ticket to reference this in code 
for the areas that will be refactored.

 

I hope to tackle it immediately after.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15984) Client disconnections can cause hanging transactions on __consumer_offsets

2023-12-06 Thread Justine Olshan (Jira)
Justine Olshan created KAFKA-15984:
--

 Summary: Client disconnections can cause hanging transactions on 
__consumer_offsets
 Key: KAFKA-15984
 URL: https://issues.apache.org/jira/browse/KAFKA-15984
 Project: Kafka
  Issue Type: Task
Reporter: Justine Olshan


When investigating frequent hanging transactions on __consumer_offsets 
partitions, we realized that many of them were cause by the same offset being 
committed with duplicates and one with `"isDisconnectedClient":true`. 

TxnOffsetCommits do not have sequence numbers and thus are not protected 
against duplicates in the same way idempotent produce requests are. Thus, when 
a client is disconnected (and flushes its requests), we may see the duplicate 
get appended to the log. 

KIP-890 part 1 should protect against this as the duplicate will not succeed 
verification. KIP-890 part 2 strengthens this further as duplicates (from 
previous transactions) can not be added to new transactions if the partitions 
is re-added since the epoch will be bumped. 

Another possible solution is to do duplicate checking on the group coordinator 
side when the request comes in. This solution could be used instead of KIP-890 
part 1 to prevent hanging transactions but given that part 1 only has one open 
PR remaining, we may not need to do this. However, this can also prevent 
duplicates from being added to a new transaction – something only part 2 will 
protect against.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15975) Update kafka quickstart guide to no longer list ZK start first

2023-12-05 Thread Justine Olshan (Jira)
Justine Olshan created KAFKA-15975:
--

 Summary: Update kafka quickstart guide to no longer list ZK start 
first
 Key: KAFKA-15975
 URL: https://issues.apache.org/jira/browse/KAFKA-15975
 Project: Kafka
  Issue Type: Task
  Components: docs
Affects Versions: 4.0.0
Reporter: Justine Olshan


Given we are deprecating ZooKeeper, I think we should update our quickstart 
guide to not list the ZooKeeper instructions first.

With 4.0, we may want to remove it entirely.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15957) ConsistencyVectorIntegrationTest.shouldHaveSamePositionBoundActiveAndStandBy broken

2023-12-01 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792268#comment-17792268
 ] 

Justine Olshan commented on KAFKA-15957:


[https://github.com/apache/kafka/pull/14895] is fixing this

> ConsistencyVectorIntegrationTest.shouldHaveSamePositionBoundActiveAndStandBy 
> broken
> ---
>
> Key: KAFKA-15957
> URL: https://issues.apache.org/jira/browse/KAFKA-15957
> Project: Kafka
>  Issue Type: Bug
>Reporter: Justine Olshan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15957) ConsistencyVectorIntegrationTest.shouldHaveSamePositionBoundActiveAndStandBy broken

2023-12-01 Thread Justine Olshan (Jira)
Justine Olshan created KAFKA-15957:
--

 Summary: 
ConsistencyVectorIntegrationTest.shouldHaveSamePositionBoundActiveAndStandBy 
broken
 Key: KAFKA-15957
 URL: https://issues.apache.org/jira/browse/KAFKA-15957
 Project: Kafka
  Issue Type: Bug
Reporter: Justine Olshan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (KAFKA-15947) Null pointer on LZ4 compression since Kafka 3.6

2023-11-29 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17791332#comment-17791332
 ] 

Justine Olshan edited comment on KAFKA-15947 at 11/29/23 9:41 PM:
--

Hey – is this the same as https://issues.apache.org/jira/browse/KAFKA-15653? 
Are you running with transactions?

The above ticket will be fixed in 3.6.1


was (Author: jolshan):
Hey – is this the same as https://issues.apache.org/jira/browse/KAFKA-15653? 
Are you running with transactions?

> Null pointer on LZ4 compression since Kafka 3.6
> ---
>
> Key: KAFKA-15947
> URL: https://issues.apache.org/jira/browse/KAFKA-15947
> Project: Kafka
>  Issue Type: Bug
>  Components: compression
>Affects Versions: 3.6.0
>Reporter: Ludo
>Priority: Major
>
> I have a Kafka Stream application running well since month using client 
> version {{3.5.1 }}with 3.5.1 (bitnami image: {{bitnami/3.5.1-debian-11-r44)}} 
> using{{ compression.type: "lz4"}}
> I've recently updated a my kafka server to kafka 3.6 (bitnami image: 
> {{{}bitnami/kafka:3.6.0-debian-11-r0){}}}.
>  
> The startup is working well for days, and after some time, Kafka Stream crash 
> and Kafka output a lot of NullPointerException on the console: 
>  
> {code:java}
> org.apache.kafka.common.KafkaException: java.lang.NullPointerException: 
> Cannot invoke "java.nio.ByteBuffer.hasArray()" because 
> "this.intermediateBufRef" is null
>   at 
> org.apache.kafka.common.record.CompressionType$4.wrapForInput(CompressionType.java:134)
>   at 
> org.apache.kafka.common.record.DefaultRecordBatch.recordInputStream(DefaultRecordBatch.java:273)
>   at 
> org.apache.kafka.common.record.DefaultRecordBatch.compressedIterator(DefaultRecordBatch.java:277)
>   at 
> org.apache.kafka.common.record.DefaultRecordBatch.skipKeyValueIterator(DefaultRecordBatch.java:352)
>   at 
> org.apache.kafka.storage.internals.log.LogValidator.validateMessagesAndAssignOffsetsCompressed(LogValidator.java:358)
>   at 
> org.apache.kafka.storage.internals.log.LogValidator.validateMessagesAndAssignOffsets(LogValidator.java:165)
>   at kafka.log.UnifiedLog.$anonfun$append$2(UnifiedLog.scala:805)
>   at kafka.log.UnifiedLog.append(UnifiedLog.scala:1845)
>   at kafka.log.UnifiedLog.appendAsLeader(UnifiedLog.scala:719)
>   at 
> kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:1313)
>   at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:1301)
>   at 
> kafka.server.ReplicaManager.$anonfun$appendToLocalLog$6(ReplicaManager.scala:1210)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>   at 
> scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149)
>   at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237)
>   at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230)
>   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44)
>   at scala.collection.mutable.HashMap.foreach(HashMap.scala:149)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:1198)
>   at 
> kafka.server.ReplicaManager.$anonfun$appendRecords$18$adapted(ReplicaManager.scala:754)
>   at 
> kafka.server.KafkaRequestHandler$.$anonfun$wrap$3(KafkaRequestHandler.scala:73)
>   at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:130)
>   at java.base/java.lang.Thread.run(Thread.java:833)
> Caused by: java.lang.NullPointerException: Cannot invoke 
> "java.nio.ByteBuffer.hasArray()" because "this.intermediateBufRef" is null
>   at 
> org.apache.kafka.common.utils.ChunkedBytesStream.(ChunkedBytesStream.java:89)
>   at 
> org.apache.kafka.common.record.CompressionType$4.wrapForInput(CompressionType.java:132)
>   ... 25 more {code}
> At the same time the Kafka Stream raise this error:
>  
> {code:java}
> org.apache.kafka.streams.errors.StreamsException: Error encountered sending 
> record to topic kestra_workertaskresult for task 3_6 due 
> to:org.apache.kafka.common.errors.UnknownServerException: The server 
> experienced an unexpected error when processing the request.Written offsets 
> would not be recorded and no more records would be sent since this is a fatal 
> error.at 
> org.apache.kafka.streams.processor.internals.RecordCollectorImpl.recordSendError(RecordCollectorImpl.java:297)at
>  
> org.apache.kafka.streams.processor.internals.RecordCollectorImpl.lambda$send$1(RecordCollectorImpl.java:284)at
>  
> 

[jira] [Commented] (KAFKA-15947) Null pointer on LZ4 compression since Kafka 3.6

2023-11-29 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17791332#comment-17791332
 ] 

Justine Olshan commented on KAFKA-15947:


Hey – is this the same as https://issues.apache.org/jira/browse/KAFKA-15653? 
Are you running with transactions?

> Null pointer on LZ4 compression since Kafka 3.6
> ---
>
> Key: KAFKA-15947
> URL: https://issues.apache.org/jira/browse/KAFKA-15947
> Project: Kafka
>  Issue Type: Bug
>  Components: compression
>Affects Versions: 3.6.0
>Reporter: Ludo
>Priority: Major
>
> I have a Kafka Stream application running well since month using client 
> version {{3.5.1 }}with 3.5.1 (bitnami image: {{bitnami/3.5.1-debian-11-r44)}} 
> using{{ compression.type: "lz4"}}
> I've recently updated a my kafka server to kafka 3.6 (bitnami image: 
> {{{}bitnami/kafka:3.6.0-debian-11-r0){}}}.
>  
> The startup is working well for days, and after some time, Kafka Stream crash 
> and Kafka output a lot of NullPointerException on the console: 
>  
> {code:java}
> org.apache.kafka.common.KafkaException: java.lang.NullPointerException: 
> Cannot invoke "java.nio.ByteBuffer.hasArray()" because 
> "this.intermediateBufRef" is null
>   at 
> org.apache.kafka.common.record.CompressionType$4.wrapForInput(CompressionType.java:134)
>   at 
> org.apache.kafka.common.record.DefaultRecordBatch.recordInputStream(DefaultRecordBatch.java:273)
>   at 
> org.apache.kafka.common.record.DefaultRecordBatch.compressedIterator(DefaultRecordBatch.java:277)
>   at 
> org.apache.kafka.common.record.DefaultRecordBatch.skipKeyValueIterator(DefaultRecordBatch.java:352)
>   at 
> org.apache.kafka.storage.internals.log.LogValidator.validateMessagesAndAssignOffsetsCompressed(LogValidator.java:358)
>   at 
> org.apache.kafka.storage.internals.log.LogValidator.validateMessagesAndAssignOffsets(LogValidator.java:165)
>   at kafka.log.UnifiedLog.$anonfun$append$2(UnifiedLog.scala:805)
>   at kafka.log.UnifiedLog.append(UnifiedLog.scala:1845)
>   at kafka.log.UnifiedLog.appendAsLeader(UnifiedLog.scala:719)
>   at 
> kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:1313)
>   at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:1301)
>   at 
> kafka.server.ReplicaManager.$anonfun$appendToLocalLog$6(ReplicaManager.scala:1210)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>   at 
> scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149)
>   at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237)
>   at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230)
>   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44)
>   at scala.collection.mutable.HashMap.foreach(HashMap.scala:149)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:1198)
>   at 
> kafka.server.ReplicaManager.$anonfun$appendRecords$18$adapted(ReplicaManager.scala:754)
>   at 
> kafka.server.KafkaRequestHandler$.$anonfun$wrap$3(KafkaRequestHandler.scala:73)
>   at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:130)
>   at java.base/java.lang.Thread.run(Thread.java:833)
> Caused by: java.lang.NullPointerException: Cannot invoke 
> "java.nio.ByteBuffer.hasArray()" because "this.intermediateBufRef" is null
>   at 
> org.apache.kafka.common.utils.ChunkedBytesStream.(ChunkedBytesStream.java:89)
>   at 
> org.apache.kafka.common.record.CompressionType$4.wrapForInput(CompressionType.java:132)
>   ... 25 more {code}
> At the same time the Kafka Stream raise this error:
>  
> {code:java}
> org.apache.kafka.streams.errors.StreamsException: Error encountered sending 
> record to topic kestra_workertaskresult for task 3_6 due 
> to:org.apache.kafka.common.errors.UnknownServerException: The server 
> experienced an unexpected error when processing the request.Written offsets 
> would not be recorded and no more records would be sent since this is a fatal 
> error.at 
> org.apache.kafka.streams.processor.internals.RecordCollectorImpl.recordSendError(RecordCollectorImpl.java:297)at
>  
> org.apache.kafka.streams.processor.internals.RecordCollectorImpl.lambda$send$1(RecordCollectorImpl.java:284)at
>  
> org.apache.kafka.clients.producer.KafkaProducer$AppendCallbacks.onCompletion(KafkaProducer.java:1505)at
>  
> org.apache.kafka.clients.producer.internals.ProducerBatch.completeFutureAndFireCallbacks(ProducerBatch.java:273)at
>  
> 

[jira] [Commented] (KAFKA-15653) NPE in ChunkedByteStream

2023-11-15 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17786338#comment-17786338
 ] 

Justine Olshan commented on KAFKA-15653:


Yes. [~h...@pinterest.com] this is resolved. KAFKA-15784 which was discovered 
while working on this is still open, but I'm working on it.

> NPE in ChunkedByteStream
> 
>
> Key: KAFKA-15653
> URL: https://issues.apache.org/jira/browse/KAFKA-15653
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Affects Versions: 3.6.0
> Environment: Docker container on a Linux laptop, using the latest 
> release.
>Reporter: Travis Bischel
>Assignee: Justine Olshan
>Priority: Major
> Fix For: 3.7.0, 3.6.1
>
> Attachments: repro.sh
>
>
> When looping franz-go integration tests, I received an UNKNOWN_SERVER_ERROR 
> from producing. The broker logs for the failing request:
>  
> {noformat}
> [2023-10-19 22:29:58,160] ERROR [ReplicaManager broker=2] Error processing 
> append operation on partition 
> 2fa8995d8002fbfe68a96d783f26aa2c5efc15368bf44ed8f2ab7e24b41b9879-24 
> (kafka.server.ReplicaManager)
> java.lang.NullPointerException
>   at 
> org.apache.kafka.common.utils.ChunkedBytesStream.(ChunkedBytesStream.java:89)
>   at 
> org.apache.kafka.common.record.CompressionType$3.wrapForInput(CompressionType.java:105)
>   at 
> org.apache.kafka.common.record.DefaultRecordBatch.recordInputStream(DefaultRecordBatch.java:273)
>   at 
> org.apache.kafka.common.record.DefaultRecordBatch.compressedIterator(DefaultRecordBatch.java:277)
>   at 
> org.apache.kafka.common.record.DefaultRecordBatch.skipKeyValueIterator(DefaultRecordBatch.java:352)
>   at 
> org.apache.kafka.storage.internals.log.LogValidator.validateMessagesAndAssignOffsetsCompressed(LogValidator.java:358)
>   at 
> org.apache.kafka.storage.internals.log.LogValidator.validateMessagesAndAssignOffsets(LogValidator.java:165)
>   at kafka.log.UnifiedLog.append(UnifiedLog.scala:805)
>   at kafka.log.UnifiedLog.appendAsLeader(UnifiedLog.scala:719)
>   at 
> kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:1313)
>   at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:1301)
>   at 
> kafka.server.ReplicaManager.$anonfun$appendToLocalLog$6(ReplicaManager.scala:1210)
>   at 
> scala.collection.StrictOptimizedMapOps.map(StrictOptimizedMapOps.scala:28)
>   at 
> scala.collection.StrictOptimizedMapOps.map$(StrictOptimizedMapOps.scala:27)
>   at scala.collection.mutable.HashMap.map(HashMap.scala:35)
>   at 
> kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:1198)
>   at kafka.server.ReplicaManager.appendEntries$1(ReplicaManager.scala:754)
>   at 
> kafka.server.ReplicaManager.$anonfun$appendRecords$18(ReplicaManager.scala:874)
>   at 
> kafka.server.ReplicaManager.$anonfun$appendRecords$18$adapted(ReplicaManager.scala:874)
>   at 
> kafka.server.KafkaRequestHandler$.$anonfun$wrap$3(KafkaRequestHandler.scala:73)
>   at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:130)
>   at java.base/java.lang.Thread.run(Unknown Source)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15653) NPE in ChunkedByteStream

2023-11-15 Thread Justine Olshan (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Justine Olshan resolved KAFKA-15653.

Fix Version/s: 3.7.0
   3.6.1
   Resolution: Fixed

> NPE in ChunkedByteStream
> 
>
> Key: KAFKA-15653
> URL: https://issues.apache.org/jira/browse/KAFKA-15653
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Affects Versions: 3.6.0
> Environment: Docker container on a Linux laptop, using the latest 
> release.
>Reporter: Travis Bischel
>Assignee: Justine Olshan
>Priority: Major
> Fix For: 3.7.0, 3.6.1
>
> Attachments: repro.sh
>
>
> When looping franz-go integration tests, I received an UNKNOWN_SERVER_ERROR 
> from producing. The broker logs for the failing request:
>  
> {noformat}
> [2023-10-19 22:29:58,160] ERROR [ReplicaManager broker=2] Error processing 
> append operation on partition 
> 2fa8995d8002fbfe68a96d783f26aa2c5efc15368bf44ed8f2ab7e24b41b9879-24 
> (kafka.server.ReplicaManager)
> java.lang.NullPointerException
>   at 
> org.apache.kafka.common.utils.ChunkedBytesStream.(ChunkedBytesStream.java:89)
>   at 
> org.apache.kafka.common.record.CompressionType$3.wrapForInput(CompressionType.java:105)
>   at 
> org.apache.kafka.common.record.DefaultRecordBatch.recordInputStream(DefaultRecordBatch.java:273)
>   at 
> org.apache.kafka.common.record.DefaultRecordBatch.compressedIterator(DefaultRecordBatch.java:277)
>   at 
> org.apache.kafka.common.record.DefaultRecordBatch.skipKeyValueIterator(DefaultRecordBatch.java:352)
>   at 
> org.apache.kafka.storage.internals.log.LogValidator.validateMessagesAndAssignOffsetsCompressed(LogValidator.java:358)
>   at 
> org.apache.kafka.storage.internals.log.LogValidator.validateMessagesAndAssignOffsets(LogValidator.java:165)
>   at kafka.log.UnifiedLog.append(UnifiedLog.scala:805)
>   at kafka.log.UnifiedLog.appendAsLeader(UnifiedLog.scala:719)
>   at 
> kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:1313)
>   at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:1301)
>   at 
> kafka.server.ReplicaManager.$anonfun$appendToLocalLog$6(ReplicaManager.scala:1210)
>   at 
> scala.collection.StrictOptimizedMapOps.map(StrictOptimizedMapOps.scala:28)
>   at 
> scala.collection.StrictOptimizedMapOps.map$(StrictOptimizedMapOps.scala:27)
>   at scala.collection.mutable.HashMap.map(HashMap.scala:35)
>   at 
> kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:1198)
>   at kafka.server.ReplicaManager.appendEntries$1(ReplicaManager.scala:754)
>   at 
> kafka.server.ReplicaManager.$anonfun$appendRecords$18(ReplicaManager.scala:874)
>   at 
> kafka.server.ReplicaManager.$anonfun$appendRecords$18$adapted(ReplicaManager.scala:874)
>   at 
> kafka.server.KafkaRequestHandler$.$anonfun$wrap$3(KafkaRequestHandler.scala:73)
>   at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:130)
>   at java.base/java.lang.Thread.run(Unknown Source)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-14806) Add connection timeout in PlaintextSender used by SelectorTests

2023-11-09 Thread Justine Olshan (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-14806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784619#comment-17784619
 ] 

Justine Olshan commented on KAFKA-14806:


Had the same sort of failures on my run recently: 
[https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-14712/6/] 
Funny how the SocketServerTests also all fail too.

> Add connection timeout in PlaintextSender used by SelectorTests
> ---
>
> Key: KAFKA-14806
> URL: https://issues.apache.org/jira/browse/KAFKA-14806
> Project: Kafka
>  Issue Type: Test
>Reporter: Alexandre Dupriez
>Assignee: Alexandre Dupriez
>Priority: Minor
>
> Tests in {{SelectorTest}} can fail due to spurious connection timeouts. One 
> example can be found in [this 
> build|https://github.com/apache/kafka/pull/13378/checks?check_run_id=11970595528]
>  where the client connection the {{PlaintextSender}} tried to open could not 
> be established before the test timed out.
> It may be worth enforcing connection timeout and retries if this can add to 
> the selector tests resiliency. Note that {{PlaintextSender}} is only used by 
> the {{SelectorTest}} so the scope of the change would remain local.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15798) Flaky Test NamedTopologyIntegrationTest.shouldAddAndRemoveNamedTopologiesBeforeStartingAndRouteQueriesToCorrectTopology()

2023-11-07 Thread Justine Olshan (Jira)
Justine Olshan created KAFKA-15798:
--

 Summary: Flaky Test 
NamedTopologyIntegrationTest.shouldAddAndRemoveNamedTopologiesBeforeStartingAndRouteQueriesToCorrectTopology()
 Key: KAFKA-15798
 URL: https://issues.apache.org/jira/browse/KAFKA-15798
 Project: Kafka
  Issue Type: Bug
Reporter: Justine Olshan


I saw a few examples recently. 2 have the same error, but the third is different

[https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-14629/22/testReport/junit/org.apache.kafka.streams.integration/NamedTopologyIntegrationTest/Build___JDK_8_and_Scala_2_12___shouldAddAndRemoveNamedTopologiesBeforeStartingAndRouteQueriesToCorrectTopology___2/]

[https://ci-builds.apache.org/job/Kafka/job/kafka/job/trunk/2365/testReport/junit/org.apache.kafka.streams.integration/NamedTopologyIntegrationTest/Build___JDK_21_and_Scala_2_13___shouldAddAndRemoveNamedTopologiesBeforeStartingAndRouteQueriesToCorrectTopology__/]
 

The failure is like


{code:java}
java.lang.AssertionError: Did not receive all 5 records from topic 
output-stream-1 within 6 ms, currently accumulated data is [] Expected: is 
a value equal to or greater than <5> but: <0> was less than <5>{code}

The other failure was
[https://ci-builds.apache.org/job/Kafka/job/kafka/job/trunk/2365/testReport/junit/org.apache.kafka.streams.integration/NamedTopologyIntegrationTest/Build___JDK_8_and_Scala_2_12___shouldAddAndRemoveNamedTopologiesBeforeStartingAndRouteQueriesToCorrectTopology__/]


{code:java}
java.lang.AssertionError: Expected: <[0, 1]> but: was <[0]>{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   5   >