[jira] [Commented] (KAFKA-16692) InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not enabled when upgrading from kafka 3.5 to 3.6
[ https://issues.apache.org/jira/browse/KAFKA-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846763#comment-17846763 ] Justine Olshan commented on KAFKA-16692: Hey [~akaltsikis] yeah, those are the release notes for 3.6. We can probably edit the kafka-site repo to get the change to show up in real time, but updating via the kafka repo requires waiting for the next release for the site to update. > InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not > enabled when upgrading from kafka 3.5 to 3.6 > > > Key: KAFKA-16692 > URL: https://issues.apache.org/jira/browse/KAFKA-16692 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 3.7.0, 3.6.1, 3.8 >Reporter: Johnson Okorie >Assignee: Justine Olshan >Priority: Major > > We have a kafka cluster running on version 3.5.2 that we are upgrading to > 3.6.1. This cluster has a lot of clients with exactly one semantics enabled > and hence creating transactions. As we replaced brokers with the new > binaries, we observed lots of clients in the cluster experiencing the > following error: > {code:java} > 2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=, > transactionalId=] Got error produce response with > correlation id 6402937 on topic-partition , retrying > (2147483512 attempts left). Error: NETWORK_EXCEPTION. Error Message: The > server disconnected before a response was received.{code} > On inspecting the broker, we saw the following errors on brokers still > running Kafka version 3.5.2: > > {code:java} > message: > Closing socket for because of error > exception_exception_class: > org.apache.kafka.common.errors.InvalidRequestException > exception_exception_message: > Received request api key ADD_PARTITIONS_TO_TXN with version 4 which is not > enabled > exception_stacktrace: > org.apache.kafka.common.errors.InvalidRequestException: Received request api > key ADD_PARTITIONS_TO_TXN with version 4 which is not enabled > {code} > On the new brokers running 3.6.1 we saw the following errors: > > {code:java} > [AddPartitionsToTxnSenderThread-1055]: AddPartitionsToTxnRequest failed for > node 1043 with a network exception.{code} > > I can also see this : > {code:java} > [AddPartitionsToTxnManager broker=1055]Cancelled in-flight > ADD_PARTITIONS_TO_TXN request with correlation id 21120 due to node 1043 > being disconnected (elapsed time since creation: 11ms, elapsed time since > send: 4ms, request timeout: 3ms){code} > We started investigating this issue and digging through the changes in 3.6, > we came across some changes introduced as part of KAFKA-14402 that we thought > might lead to this behaviour. > First we could see that _transaction.partition.verification.enable_ is > enabled by default and enables a new code path that culminates in we sending > version 4 ADD_PARTITIONS_TO_TXN requests to other brokers that are generated > [here|https://github.com/apache/kafka/blob/29f3260a9c07e654a28620aeb93a778622a5233d/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L269]. > From a > [discussion|https://lists.apache.org/thread/4895wrd1z92kjb708zck4s1f62xq6r8x] > on the mailing list, [~jolshan] pointed out that this scenario shouldn't be > possible as the following code paths should prevent version 4 > ADD_PARTITIONS_TO_TXN requests being sent to other brokers: > [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/clients/src/main/java/org/apache/kafka/clients/NodeApiVersions.java#L130] > > [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L195] > However, these requests are still sent to other brokers in our environment. > On further inspection of the code, I am wondering if the following code path > could lead to this issue: > [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L500] > In this scenario, we don't have any _NodeApiVersions_ available for the > specified nodeId and potentially skipping the _latestUsableVersion_ check. I > am wondering if it is possible that because _discoverBrokerVersions_ is set > to _false_ for the network client of the {_}AddPartitionsToTxnManager{_}, it > skips fetching {_}NodeApiVersions{_}? I can see that we create the network > client here: > [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/core/src/main/scala/kafka/server/KafkaServer.scala#L641] > The _NetworkUtils.buildNetworkClient_ method seems to create a network client > that has _discoverBrokerVersions_ set to
[jira] [Updated] (KAFKA-16692) InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not enabled when upgrading from kafka 3.5 to 3.6
[ https://issues.apache.org/jira/browse/KAFKA-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan updated KAFKA-16692: --- Affects Version/s: 3.8 > InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not > enabled when upgrading from kafka 3.5 to 3.6 > > > Key: KAFKA-16692 > URL: https://issues.apache.org/jira/browse/KAFKA-16692 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 3.7.0, 3.6.1, 3.8 >Reporter: Johnson Okorie >Assignee: Justine Olshan >Priority: Major > > We have a kafka cluster running on version 3.5.2 that we are upgrading to > 3.6.1. This cluster has a lot of clients with exactly one semantics enabled > and hence creating transactions. As we replaced brokers with the new > binaries, we observed lots of clients in the cluster experiencing the > following error: > {code:java} > 2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=, > transactionalId=] Got error produce response with > correlation id 6402937 on topic-partition , retrying > (2147483512 attempts left). Error: NETWORK_EXCEPTION. Error Message: The > server disconnected before a response was received.{code} > On inspecting the broker, we saw the following errors on brokers still > running Kafka version 3.5.2: > > {code:java} > message: > Closing socket for because of error > exception_exception_class: > org.apache.kafka.common.errors.InvalidRequestException > exception_exception_message: > Received request api key ADD_PARTITIONS_TO_TXN with version 4 which is not > enabled > exception_stacktrace: > org.apache.kafka.common.errors.InvalidRequestException: Received request api > key ADD_PARTITIONS_TO_TXN with version 4 which is not enabled > {code} > On the new brokers running 3.6.1 we saw the following errors: > > {code:java} > [AddPartitionsToTxnSenderThread-1055]: AddPartitionsToTxnRequest failed for > node 1043 with a network exception.{code} > > I can also see this : > {code:java} > [AddPartitionsToTxnManager broker=1055]Cancelled in-flight > ADD_PARTITIONS_TO_TXN request with correlation id 21120 due to node 1043 > being disconnected (elapsed time since creation: 11ms, elapsed time since > send: 4ms, request timeout: 3ms){code} > We started investigating this issue and digging through the changes in 3.6, > we came across some changes introduced as part of KAFKA-14402 that we thought > might lead to this behaviour. > First we could see that _transaction.partition.verification.enable_ is > enabled by default and enables a new code path that culminates in we sending > version 4 ADD_PARTITIONS_TO_TXN requests to other brokers that are generated > [here|https://github.com/apache/kafka/blob/29f3260a9c07e654a28620aeb93a778622a5233d/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L269]. > From a > [discussion|https://lists.apache.org/thread/4895wrd1z92kjb708zck4s1f62xq6r8x] > on the mailing list, [~jolshan] pointed out that this scenario shouldn't be > possible as the following code paths should prevent version 4 > ADD_PARTITIONS_TO_TXN requests being sent to other brokers: > [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/clients/src/main/java/org/apache/kafka/clients/NodeApiVersions.java#L130] > > [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L195] > However, these requests are still sent to other brokers in our environment. > On further inspection of the code, I am wondering if the following code path > could lead to this issue: > [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L500] > In this scenario, we don't have any _NodeApiVersions_ available for the > specified nodeId and potentially skipping the _latestUsableVersion_ check. I > am wondering if it is possible that because _discoverBrokerVersions_ is set > to _false_ for the network client of the {_}AddPartitionsToTxnManager{_}, it > skips fetching {_}NodeApiVersions{_}? I can see that we create the network > client here: > [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/core/src/main/scala/kafka/server/KafkaServer.scala#L641] > The _NetworkUtils.buildNetworkClient_ method seems to create a network client > that has _discoverBrokerVersions_ set to {_}false{_}. > I was hoping I could get some assistance debugging this issue. Happy to > provide any additional information needed. > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16692) InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not enabled when upgrading from kafka 3.5 to 3.6
[ https://issues.apache.org/jira/browse/KAFKA-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan updated KAFKA-16692: --- Affects Version/s: 3.7.0 > InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not > enabled when upgrading from kafka 3.5 to 3.6 > > > Key: KAFKA-16692 > URL: https://issues.apache.org/jira/browse/KAFKA-16692 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 3.7.0, 3.6.1 >Reporter: Johnson Okorie >Assignee: Justine Olshan >Priority: Major > > We have a kafka cluster running on version 3.5.2 that we are upgrading to > 3.6.1. This cluster has a lot of clients with exactly one semantics enabled > and hence creating transactions. As we replaced brokers with the new > binaries, we observed lots of clients in the cluster experiencing the > following error: > {code:java} > 2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=, > transactionalId=] Got error produce response with > correlation id 6402937 on topic-partition , retrying > (2147483512 attempts left). Error: NETWORK_EXCEPTION. Error Message: The > server disconnected before a response was received.{code} > On inspecting the broker, we saw the following errors on brokers still > running Kafka version 3.5.2: > > {code:java} > message: > Closing socket for because of error > exception_exception_class: > org.apache.kafka.common.errors.InvalidRequestException > exception_exception_message: > Received request api key ADD_PARTITIONS_TO_TXN with version 4 which is not > enabled > exception_stacktrace: > org.apache.kafka.common.errors.InvalidRequestException: Received request api > key ADD_PARTITIONS_TO_TXN with version 4 which is not enabled > {code} > On the new brokers running 3.6.1 we saw the following errors: > > {code:java} > [AddPartitionsToTxnSenderThread-1055]: AddPartitionsToTxnRequest failed for > node 1043 with a network exception.{code} > > I can also see this : > {code:java} > [AddPartitionsToTxnManager broker=1055]Cancelled in-flight > ADD_PARTITIONS_TO_TXN request with correlation id 21120 due to node 1043 > being disconnected (elapsed time since creation: 11ms, elapsed time since > send: 4ms, request timeout: 3ms){code} > We started investigating this issue and digging through the changes in 3.6, > we came across some changes introduced as part of KAFKA-14402 that we thought > might lead to this behaviour. > First we could see that _transaction.partition.verification.enable_ is > enabled by default and enables a new code path that culminates in we sending > version 4 ADD_PARTITIONS_TO_TXN requests to other brokers that are generated > [here|https://github.com/apache/kafka/blob/29f3260a9c07e654a28620aeb93a778622a5233d/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L269]. > From a > [discussion|https://lists.apache.org/thread/4895wrd1z92kjb708zck4s1f62xq6r8x] > on the mailing list, [~jolshan] pointed out that this scenario shouldn't be > possible as the following code paths should prevent version 4 > ADD_PARTITIONS_TO_TXN requests being sent to other brokers: > [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/clients/src/main/java/org/apache/kafka/clients/NodeApiVersions.java#L130] > > [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L195] > However, these requests are still sent to other brokers in our environment. > On further inspection of the code, I am wondering if the following code path > could lead to this issue: > [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L500] > In this scenario, we don't have any _NodeApiVersions_ available for the > specified nodeId and potentially skipping the _latestUsableVersion_ check. I > am wondering if it is possible that because _discoverBrokerVersions_ is set > to _false_ for the network client of the {_}AddPartitionsToTxnManager{_}, it > skips fetching {_}NodeApiVersions{_}? I can see that we create the network > client here: > [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/core/src/main/scala/kafka/server/KafkaServer.scala#L641] > The _NetworkUtils.buildNetworkClient_ method seems to create a network client > that has _discoverBrokerVersions_ set to {_}false{_}. > I was hoping I could get some assistance debugging this issue. Happy to > provide any additional information needed. > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16692) InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not enabled when upgrading from kafka 3.5 to 3.6
[ https://issues.apache.org/jira/browse/KAFKA-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846742#comment-17846742 ] Justine Olshan commented on KAFKA-16692: [~akaltsikis] Hmmm...I've included information in release notes about bugs like this, but not sure if there is a place to update between releases. Good new though is that the PR is almost ready. Just running a final batch of tests. > InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not > enabled when upgrading from kafka 3.5 to 3.6 > > > Key: KAFKA-16692 > URL: https://issues.apache.org/jira/browse/KAFKA-16692 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 3.6.1 >Reporter: Johnson Okorie >Assignee: Justine Olshan >Priority: Major > > We have a kafka cluster running on version 3.5.2 that we are upgrading to > 3.6.1. This cluster has a lot of clients with exactly one semantics enabled > and hence creating transactions. As we replaced brokers with the new > binaries, we observed lots of clients in the cluster experiencing the > following error: > {code:java} > 2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=, > transactionalId=] Got error produce response with > correlation id 6402937 on topic-partition , retrying > (2147483512 attempts left). Error: NETWORK_EXCEPTION. Error Message: The > server disconnected before a response was received.{code} > On inspecting the broker, we saw the following errors on brokers still > running Kafka version 3.5.2: > > {code:java} > message: > Closing socket for because of error > exception_exception_class: > org.apache.kafka.common.errors.InvalidRequestException > exception_exception_message: > Received request api key ADD_PARTITIONS_TO_TXN with version 4 which is not > enabled > exception_stacktrace: > org.apache.kafka.common.errors.InvalidRequestException: Received request api > key ADD_PARTITIONS_TO_TXN with version 4 which is not enabled > {code} > On the new brokers running 3.6.1 we saw the following errors: > > {code:java} > [AddPartitionsToTxnSenderThread-1055]: AddPartitionsToTxnRequest failed for > node 1043 with a network exception.{code} > > I can also see this : > {code:java} > [AddPartitionsToTxnManager broker=1055]Cancelled in-flight > ADD_PARTITIONS_TO_TXN request with correlation id 21120 due to node 1043 > being disconnected (elapsed time since creation: 11ms, elapsed time since > send: 4ms, request timeout: 3ms){code} > We started investigating this issue and digging through the changes in 3.6, > we came across some changes introduced as part of KAFKA-14402 that we thought > might lead to this behaviour. > First we could see that _transaction.partition.verification.enable_ is > enabled by default and enables a new code path that culminates in we sending > version 4 ADD_PARTITIONS_TO_TXN requests to other brokers that are generated > [here|https://github.com/apache/kafka/blob/29f3260a9c07e654a28620aeb93a778622a5233d/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L269]. > From a > [discussion|https://lists.apache.org/thread/4895wrd1z92kjb708zck4s1f62xq6r8x] > on the mailing list, [~jolshan] pointed out that this scenario shouldn't be > possible as the following code paths should prevent version 4 > ADD_PARTITIONS_TO_TXN requests being sent to other brokers: > [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/clients/src/main/java/org/apache/kafka/clients/NodeApiVersions.java#L130] > > [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L195] > However, these requests are still sent to other brokers in our environment. > On further inspection of the code, I am wondering if the following code path > could lead to this issue: > [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L500] > In this scenario, we don't have any _NodeApiVersions_ available for the > specified nodeId and potentially skipping the _latestUsableVersion_ check. I > am wondering if it is possible that because _discoverBrokerVersions_ is set > to _false_ for the network client of the {_}AddPartitionsToTxnManager{_}, it > skips fetching {_}NodeApiVersions{_}? I can see that we create the network > client here: > [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/core/src/main/scala/kafka/server/KafkaServer.scala#L641] > The _NetworkUtils.buildNetworkClient_ method seems to create a network client > that has _discoverBrokerVersions_ set to {_}false{_}. > I
[jira] [Commented] (KAFKA-16692) InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not enabled when upgrading from kafka 3.5 to 3.6
[ https://issues.apache.org/jira/browse/KAFKA-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846591#comment-17846591 ] Justine Olshan commented on KAFKA-16692: The delay in the fix is coming from the fact that reproducing the issue in a system test is hard because while I can get the error to appear, it is hard to consistently get a test to fail due to it. I assure you I'm making some progress and hope to have a fix by the end of the week :) > InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not > enabled when upgrading from kafka 3.5 to 3.6 > > > Key: KAFKA-16692 > URL: https://issues.apache.org/jira/browse/KAFKA-16692 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 3.6.1 >Reporter: Johnson Okorie >Assignee: Justine Olshan >Priority: Major > > We have a kafka cluster running on version 3.5.2 that we are upgrading to > 3.6.1. This cluster has a lot of clients with exactly one semantics enabled > and hence creating transactions. As we replaced brokers with the new > binaries, we observed lots of clients in the cluster experiencing the > following error: > {code:java} > 2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=, > transactionalId=] Got error produce response with > correlation id 6402937 on topic-partition , retrying > (2147483512 attempts left). Error: NETWORK_EXCEPTION. Error Message: The > server disconnected before a response was received.{code} > On inspecting the broker, we saw the following errors on brokers still > running Kafka version 3.5.2: > > {code:java} > message: > Closing socket for because of error > exception_exception_class: > org.apache.kafka.common.errors.InvalidRequestException > exception_exception_message: > Received request api key ADD_PARTITIONS_TO_TXN with version 4 which is not > enabled > exception_stacktrace: > org.apache.kafka.common.errors.InvalidRequestException: Received request api > key ADD_PARTITIONS_TO_TXN with version 4 which is not enabled > {code} > On the new brokers running 3.6.1 we saw the following errors: > > {code:java} > [AddPartitionsToTxnSenderThread-1055]: AddPartitionsToTxnRequest failed for > node 1043 with a network exception.{code} > > I can also see this : > {code:java} > [AddPartitionsToTxnManager broker=1055]Cancelled in-flight > ADD_PARTITIONS_TO_TXN request with correlation id 21120 due to node 1043 > being disconnected (elapsed time since creation: 11ms, elapsed time since > send: 4ms, request timeout: 3ms){code} > We started investigating this issue and digging through the changes in 3.6, > we came across some changes introduced as part of KAFKA-14402 that we thought > might lead to this behaviour. > First we could see that _transaction.partition.verification.enable_ is > enabled by default and enables a new code path that culminates in we sending > version 4 ADD_PARTITIONS_TO_TXN requests to other brokers that are generated > [here|https://github.com/apache/kafka/blob/29f3260a9c07e654a28620aeb93a778622a5233d/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L269]. > From a > [discussion|https://lists.apache.org/thread/4895wrd1z92kjb708zck4s1f62xq6r8x] > on the mailing list, [~jolshan] pointed out that this scenario shouldn't be > possible as the following code paths should prevent version 4 > ADD_PARTITIONS_TO_TXN requests being sent to other brokers: > [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/clients/src/main/java/org/apache/kafka/clients/NodeApiVersions.java#L130] > > [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L195] > However, these requests are still sent to other brokers in our environment. > On further inspection of the code, I am wondering if the following code path > could lead to this issue: > [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L500] > In this scenario, we don't have any _NodeApiVersions_ available for the > specified nodeId and potentially skipping the _latestUsableVersion_ check. I > am wondering if it is possible that because _discoverBrokerVersions_ is set > to _false_ for the network client of the {_}AddPartitionsToTxnManager{_}, it > skips fetching {_}NodeApiVersions{_}? I can see that we create the network > client here: > [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/core/src/main/scala/kafka/server/KafkaServer.scala#L641] > The _NetworkUtils.buildNetworkClient_ method seems to create a network client > that
[jira] [Commented] (KAFKA-16692) InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not enabled when upgrading from kafka 3.5 to 3.6
[ https://issues.apache.org/jira/browse/KAFKA-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846590#comment-17846590 ] Justine Olshan commented on KAFKA-16692: Hi [~akaltsikis]. Yes the bug is due to the verification requests during the upgrade. If you disable the verification, you won't hit the bug. However, the feature was designed to allow upgrades, and there seems to be a small bug in the logic to gate the requests during the upgrade. I'm working on fixing that. In the meantime, disabling the feature on upgrade is a work around. > InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not > enabled when upgrading from kafka 3.5 to 3.6 > > > Key: KAFKA-16692 > URL: https://issues.apache.org/jira/browse/KAFKA-16692 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 3.6.1 >Reporter: Johnson Okorie >Assignee: Justine Olshan >Priority: Major > > We have a kafka cluster running on version 3.5.2 that we are upgrading to > 3.6.1. This cluster has a lot of clients with exactly one semantics enabled > and hence creating transactions. As we replaced brokers with the new > binaries, we observed lots of clients in the cluster experiencing the > following error: > {code:java} > 2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=, > transactionalId=] Got error produce response with > correlation id 6402937 on topic-partition , retrying > (2147483512 attempts left). Error: NETWORK_EXCEPTION. Error Message: The > server disconnected before a response was received.{code} > On inspecting the broker, we saw the following errors on brokers still > running Kafka version 3.5.2: > > {code:java} > message: > Closing socket for because of error > exception_exception_class: > org.apache.kafka.common.errors.InvalidRequestException > exception_exception_message: > Received request api key ADD_PARTITIONS_TO_TXN with version 4 which is not > enabled > exception_stacktrace: > org.apache.kafka.common.errors.InvalidRequestException: Received request api > key ADD_PARTITIONS_TO_TXN with version 4 which is not enabled > {code} > On the new brokers running 3.6.1 we saw the following errors: > > {code:java} > [AddPartitionsToTxnSenderThread-1055]: AddPartitionsToTxnRequest failed for > node 1043 with a network exception.{code} > > I can also see this : > {code:java} > [AddPartitionsToTxnManager broker=1055]Cancelled in-flight > ADD_PARTITIONS_TO_TXN request with correlation id 21120 due to node 1043 > being disconnected (elapsed time since creation: 11ms, elapsed time since > send: 4ms, request timeout: 3ms){code} > We started investigating this issue and digging through the changes in 3.6, > we came across some changes introduced as part of KAFKA-14402 that we thought > might lead to this behaviour. > First we could see that _transaction.partition.verification.enable_ is > enabled by default and enables a new code path that culminates in we sending > version 4 ADD_PARTITIONS_TO_TXN requests to other brokers that are generated > [here|https://github.com/apache/kafka/blob/29f3260a9c07e654a28620aeb93a778622a5233d/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L269]. > From a > [discussion|https://lists.apache.org/thread/4895wrd1z92kjb708zck4s1f62xq6r8x] > on the mailing list, [~jolshan] pointed out that this scenario shouldn't be > possible as the following code paths should prevent version 4 > ADD_PARTITIONS_TO_TXN requests being sent to other brokers: > [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/clients/src/main/java/org/apache/kafka/clients/NodeApiVersions.java#L130] > > [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L195] > However, these requests are still sent to other brokers in our environment. > On further inspection of the code, I am wondering if the following code path > could lead to this issue: > [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L500] > In this scenario, we don't have any _NodeApiVersions_ available for the > specified nodeId and potentially skipping the _latestUsableVersion_ check. I > am wondering if it is possible that because _discoverBrokerVersions_ is set > to _false_ for the network client of the {_}AddPartitionsToTxnManager{_}, it > skips fetching {_}NodeApiVersions{_}? I can see that we create the network > client here: >
[jira] [Commented] (KAFKA-16692) InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not enabled when upgrading from kafka 3.5 to 3.6
[ https://issues.apache.org/jira/browse/KAFKA-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846032#comment-17846032 ] Justine Olshan commented on KAFKA-16692: Hey [~j-okorie] I'm still looking at this. Got a way to reproduce the upgrade in the test, but the test still passes likely due to having enough time to write the messages. I will tweak it and see if I can get it to fail and work with the fix. > InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not > enabled when upgrading from kafka 3.5 to 3.6 > > > Key: KAFKA-16692 > URL: https://issues.apache.org/jira/browse/KAFKA-16692 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 3.6.1 >Reporter: Johnson Okorie >Assignee: Justine Olshan >Priority: Major > > We have a kafka cluster running on version 3.5.2 that we are upgrading to > 3.6.1. This cluster has a lot of clients with exactly one semantics enabled > and hence creating transactions. As we replaced brokers with the new > binaries, we observed lots of clients in the cluster experiencing the > following error: > {code:java} > 2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=, > transactionalId=] Got error produce response with > correlation id 6402937 on topic-partition , retrying > (2147483512 attempts left). Error: NETWORK_EXCEPTION. Error Message: The > server disconnected before a response was received.{code} > On inspecting the broker, we saw the following errors on brokers still > running Kafka version 3.5.2: > > {code:java} > message: > Closing socket for because of error > exception_exception_class: > org.apache.kafka.common.errors.InvalidRequestException > exception_exception_message: > Received request api key ADD_PARTITIONS_TO_TXN with version 4 which is not > enabled > exception_stacktrace: > org.apache.kafka.common.errors.InvalidRequestException: Received request api > key ADD_PARTITIONS_TO_TXN with version 4 which is not enabled > {code} > On the new brokers running 3.6.1 we saw the following errors: > > {code:java} > [AddPartitionsToTxnSenderThread-1055]: AddPartitionsToTxnRequest failed for > node 1043 with a network exception.{code} > > I can also see this : > {code:java} > [AddPartitionsToTxnManager broker=1055]Cancelled in-flight > ADD_PARTITIONS_TO_TXN request with correlation id 21120 due to node 1043 > being disconnected (elapsed time since creation: 11ms, elapsed time since > send: 4ms, request timeout: 3ms){code} > We started investigating this issue and digging through the changes in 3.6, > we came across some changes introduced as part of KAFKA-14402 that we thought > might lead to this behaviour. > First we could see that _transaction.partition.verification.enable_ is > enabled by default and enables a new code path that culminates in we sending > version 4 ADD_PARTITIONS_TO_TXN requests to other brokers that are generated > [here|https://github.com/apache/kafka/blob/29f3260a9c07e654a28620aeb93a778622a5233d/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L269]. > From a > [discussion|https://lists.apache.org/thread/4895wrd1z92kjb708zck4s1f62xq6r8x] > on the mailing list, [~jolshan] pointed out that this scenario shouldn't be > possible as the following code paths should prevent version 4 > ADD_PARTITIONS_TO_TXN requests being sent to other brokers: > [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/clients/src/main/java/org/apache/kafka/clients/NodeApiVersions.java#L130] > > [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L195] > However, these requests are still sent to other brokers in our environment. > On further inspection of the code, I am wondering if the following code path > could lead to this issue: > [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L500] > In this scenario, we don't have any _NodeApiVersions_ available for the > specified nodeId and potentially skipping the _latestUsableVersion_ check. I > am wondering if it is possible that because _discoverBrokerVersions_ is set > to _false_ for the network client of the {_}AddPartitionsToTxnManager{_}, it > skips fetching {_}NodeApiVersions{_}? I can see that we create the network > client here: > [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/core/src/main/scala/kafka/server/KafkaServer.scala#L641] > The _NetworkUtils.buildNetworkClient_ method seems to create a network client > that has _discoverBrokerVersions_ set to
[jira] [Resolved] (KAFKA-16513) Allow WriteTxnMarkers API with Alter Cluster Permission
[ https://issues.apache.org/jira/browse/KAFKA-16513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan resolved KAFKA-16513. Resolution: Fixed > Allow WriteTxnMarkers API with Alter Cluster Permission > --- > > Key: KAFKA-16513 > URL: https://issues.apache.org/jira/browse/KAFKA-16513 > Project: Kafka > Issue Type: Improvement > Components: admin >Reporter: Nikhil Ramakrishnan >Assignee: Siddharth Yagnik >Priority: Minor > Labels: KIP-1037 > Fix For: 3.8.0 > > > We should allow WriteTxnMarkers API with Alter Cluster Permission because it > can invoked externally by a Kafka AdminClient. Such usage is more aligned > with the Alter permission on the Cluster resource, which includes other > administrative actions invoked from the Kafka AdminClient. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16699) Have Streams treat InvalidPidMappingException like a ProducerFencedException
[ https://issues.apache.org/jira/browse/KAFKA-16699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845497#comment-17845497 ] Justine Olshan commented on KAFKA-16699: Yay yay! I'm happy this is getting fixed :) > Have Streams treat InvalidPidMappingException like a ProducerFencedException > > > Key: KAFKA-16699 > URL: https://issues.apache.org/jira/browse/KAFKA-16699 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Walker Carlson >Assignee: Walker Carlson >Priority: Major > > KStreams is able to handle the ProducerFenced (among other errors) cleanly. > It does this by closing the task dirty and triggering a rebalance amongst the > worker threads to rejoin the group. The producer is also recreated. Due to > how streams works (writing to and reading from various topics), the > application is able to figure out the last thing the fenced producer > completed and continue from there. > KStreams EOS V2 also trusts that any open transaction (including those whose > producer is fenced) will be aborted by the server. This is a key factor in > how it is able to operate. In EOS V1, the new InitProducerId fences and > aborts the previous transaction. In either case, we are able to reason about > the last valid state from the fenced producer and how to proceed. > h2. InvalidPidMappingException ≈ ProducerFenced > I argue that InvalidPidMappingException can be handled in the same way. Let > me explain why. > There are two cases we see this error: > # > > {{txnManager.getTransactionState(transactionalId).flatMap { case None => > Left(Errors.INVALID_PRODUCER_ID_MAPPING)}} > # > > {{if (txnMetadata.producerId != producerId) > Left(Errors.INVALID_PRODUCER_ID_MAPPING)}} > h3. Case 1 > We are missing a value in the transactional state map for the transactional > ID. Under normal operations, this is only possible when the transactional ID > expires via the mechanism described above after > {{transactional.id.expiration.ms}} of inactivity. In this case, there is no > state that needs to be reconciled. It is safe to just rebalance and rejoin > the group with a new producer. We probably don’t even need to close the task > dirty, but it doesn’t hurt to do so. > h3. Case 2 > This is a bit more interesting. It says that we have transactional state, but > the producer ID in the request does not match the producer ID associated with > the transactional ID on the broker. How can this happen? > It is possible that a new producer instance B with the same transactional ID > was created after the transactional state expired for instance A. Given there > is no state on the server when B joins, it will get a totally new producer > ID. If the original producer A comes back, it will have state for this > transactional ID but the wrong producer ID. > In this case, the old producer ID is fenced, it’s just the normal epoch-based > fencing logic doesn’t apply. We can treat it the same however. > h2. Summary > As described in the cases above, any time we encounter the InvalidPidMapping > during normal operation, the previous producer was either finished with its > operations or was fenced. Thus, it is safe to close the dirty and rebalance + > rejoin the group just as we do with the ProducerFenced exception. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16692) InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not enabled when upgrading from kafka 3.5 to 3.6
[ https://issues.apache.org/jira/browse/KAFKA-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845004#comment-17845004 ] Justine Olshan commented on KAFKA-16692: I'm working on a test to recreate the issue, and then seeing if the proposed fix helps. > InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not > enabled when upgrading from kafka 3.5 to 3.6 > > > Key: KAFKA-16692 > URL: https://issues.apache.org/jira/browse/KAFKA-16692 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 3.6.1 >Reporter: Johnson Okorie >Assignee: Justine Olshan >Priority: Major > > We have a kafka cluster running on version 3.5.2 that we are upgrading to > 3.6.1. This cluster has a lot of clients with exactly one semantics enabled > and hence creating transactions. As we replaced brokers with the new > binaries, we observed lots of clients in the cluster experiencing the > following error: > {code:java} > 2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=, > transactionalId=] Got error produce response with > correlation id 6402937 on topic-partition , retrying > (2147483512 attempts left). Error: NETWORK_EXCEPTION. Error Message: The > server disconnected before a response was received.{code} > On inspecting the broker, we saw the following errors on brokers still > running Kafka version 3.5.2: > > {code:java} > message: > Closing socket for because of error > exception_exception_class: > org.apache.kafka.common.errors.InvalidRequestException > exception_exception_message: > Received request api key ADD_PARTITIONS_TO_TXN with version 4 which is not > enabled > exception_stacktrace: > org.apache.kafka.common.errors.InvalidRequestException: Received request api > key ADD_PARTITIONS_TO_TXN with version 4 which is not enabled > {code} > On the new brokers running 3.6.1 we saw the following errors: > > {code:java} > [AddPartitionsToTxnSenderThread-1055]: AddPartitionsToTxnRequest failed for > node 1043 with a network exception.{code} > > I can also see this : > {code:java} > [AddPartitionsToTxnManager broker=1055]Cancelled in-flight > ADD_PARTITIONS_TO_TXN request with correlation id 21120 due to node 1043 > being disconnected (elapsed time since creation: 11ms, elapsed time since > send: 4ms, request timeout: 3ms){code} > We started investigating this issue and digging through the changes in 3.6, > we came across some changes introduced as part of KAFKA-14402 that we thought > might lead to this behaviour. > First we could see that _transaction.partition.verification.enable_ is > enabled by default and enables a new code path that culminates in we sending > version 4 ADD_PARTITIONS_TO_TXN requests to other brokers that are generated > [here|https://github.com/apache/kafka/blob/29f3260a9c07e654a28620aeb93a778622a5233d/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L269]. > From a > [discussion|https://lists.apache.org/thread/4895wrd1z92kjb708zck4s1f62xq6r8x] > on the mailing list, [~jolshan] pointed out that this scenario shouldn't be > possible as the following code paths should prevent version 4 > ADD_PARTITIONS_TO_TXN requests being sent to other brokers: > [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/clients/src/main/java/org/apache/kafka/clients/NodeApiVersions.java#L130] > > [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L195] > However, these requests are still sent to other brokers in our environment. > On further inspection of the code, I am wondering if the following code path > could lead to this issue: > [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L500] > In this scenario, we don't have any _NodeApiVersions_ available for the > specified nodeId and potentially skipping _latestUsableVersion_ check as > expected. I am wondering if it is possible that because > _discoverBrokerVersions_ is set to _false_ for the network client of the > {_}AddPartitionsToTxnManager{_}, it skips fetching {_}NodeApiVersions{_}? I > can see that we create the network client here: > [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/core/src/main/scala/kafka/server/KafkaServer.scala#L641] > The _NetworkUtils.buildNetworkClient_ method seems to create a network client > that has _discoverBrokerVersions_ set to {_}false{_}. > I was hoping I could get some assistance debugging this issue. Happy to > provide any additional information needed. > > > -- This
[jira] [Commented] (KAFKA-16264) Expose `producer.id.expiration.check.interval.ms` as dynamic broker configuration
[ https://issues.apache.org/jira/browse/KAFKA-16264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844721#comment-17844721 ] Justine Olshan commented on KAFKA-16264: Thanks. I will take a look. It just occurred to me that such a change may require a small KIP. [https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals] Let me double check and get back to you. > Expose `producer.id.expiration.check.interval.ms` as dynamic broker > configuration > - > > Key: KAFKA-16264 > URL: https://issues.apache.org/jira/browse/KAFKA-16264 > Project: Kafka > Issue Type: Improvement >Reporter: Jorge Esteban Quilcate Otoya >Assignee: Jorge Esteban Quilcate Otoya >Priority: Major > > Dealing with a scenario where too many producer ids lead to issues (e.g. high > cpu utilization, see KAFKA-16229) put operators in need to flush producer ids > more promptly than usual. > Currently, only the expiration timeout `producer.id.expiration.ms` is exposed > as dynamic config. This is helpful (e.g. by reducing the timeout, less > producer would eventually be kept in memory), but not enough if the > evaluation frequency is not sufficiently short to flush producer ids before > becoming an issue. Only by tuning both, the issue could be workaround. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (KAFKA-16692) InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not enabled when
[ https://issues.apache.org/jira/browse/KAFKA-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan reassigned KAFKA-16692: -- Assignee: Justine Olshan > InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not > enabled when > > > Key: KAFKA-16692 > URL: https://issues.apache.org/jira/browse/KAFKA-16692 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 3.6.1 >Reporter: Johnson Okorie >Assignee: Justine Olshan >Priority: Major > > We have a kafka cluster running on version 3.5.2 that we are upgrading to > 3.6.1. This cluster has a lot of clients with exactly one semantics enabled > and hence creating transactions. As we replaced brokers with the new > binaries, we observed lots of clients in the cluster experiencing the > following error: > {code:java} > 2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=, > transactionalId=] Got error produce response with > correlation id 6402937 on topic-partition , retrying > (2147483512 attempts left). Error: NETWORK_EXCEPTION. Error Message: The > server disconnected before a response was received.{code} > On inspecting the broker, we saw the following errors on brokers still > running Kafka version 3.5.2: > > {code:java} > message: > Closing socket for because of error > exception_exception_class: > org.apache.kafka.common.errors.InvalidRequestException > exception_exception_message: > Received request api key ADD_PARTITIONS_TO_TXN with version 4 which is not > enabled > exception_stacktrace: > org.apache.kafka.common.errors.InvalidRequestException: Received request api > key ADD_PARTITIONS_TO_TXN with version 4 which is not enabled > {code} > On the new brokers running 3.6.1 we saw the following errors: > > {code:java} > [AddPartitionsToTxnSenderThread-1055]: AddPartitionsToTxnRequest failed for > node 1043 with a network exception.{code} > > I can also see this : > {code:java} > [AddPartitionsToTxnManager broker=1055]Cancelled in-flight > ADD_PARTITIONS_TO_TXN request with correlation id 21120 due to node 1043 > being disconnected (elapsed time since creation: 11ms, elapsed time since > send: 4ms, request timeout: 3ms){code} > We started investigating this issue and digging through the changes in 3.6, > we came across some changes introduced as part of KAFKA-14402 that we thought > might lead to this behaviour. > First we could see that _transaction.partition.verification.enable_ is > enabled by default and enables a new code path that culminates in we sending > version 4 ADD_PARTITIONS_TO_TXN requests to other brokers that are generated > [here|#L269]]. > From a > [discussion|https://lists.apache.org/thread/4895wrd1z92kjb708zck4s1f62xq6r8x] > on the mailing list, [~jolshan] pointed out that this scenario shouldn't be > possible as the following code paths should prevent version 4 > ADD_PARTITIONS_TO_TXN requests being sent to other brokers: > [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/clients/src/main/java/org/apache/kafka/clients/NodeApiVersions.java#L130] > > [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L195] > However, this seems to be these requests are still sent to other brokers in > our environment. > On further inspection of the code, I am wondering if the following code path > could lead to this issue: > [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L500] > In this scenario, we don't have any _NodeApiVersions_ available for the > specified nodeId and potentially skipping _latestUsableVersion_ check as > expected. I am wondering if it is possible that because > _discoverBrokerVersions_ is set to false for the network client of the > AddPartitionsToTxnManager, it skips fetching ApiVersions? I can see here that > we create the network client here: > [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/core/src/main/scala/kafka/server/KafkaServer.scala#L641] > This _NetworkUtils.buildNetworkClient_ seems to create a network client that > has _discoverBrokerVersions_ set to false. > I was hoping I could get some assistance debugging this issue. Happy to > provide any additional information needed. > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16692) InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not enabled when
[ https://issues.apache.org/jira/browse/KAFKA-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844708#comment-17844708 ] Justine Olshan commented on KAFKA-16692: Hi [~j-okorie]! Thanks for filing the ticket and the additional debugging. I will look at this today. > InvalidRequestException: ADD_PARTITIONS_TO_TXN with version 4 which is not > enabled when > > > Key: KAFKA-16692 > URL: https://issues.apache.org/jira/browse/KAFKA-16692 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 3.6.1 >Reporter: Johnson Okorie >Priority: Major > > We have a kafka cluster running on version 3.5.2 that we are upgrading to > 3.6.1. This cluster has a lot of clients with exactly one semantics enabled > and hence creating transactions. As we replaced brokers with the new > binaries, we observed lots of clients in the cluster experiencing the > following error: > {code:java} > 2024-05-07T09:08:10.039Z "tid": "" -- [Producer clientId=, > transactionalId=] Got error produce response with > correlation id 6402937 on topic-partition , retrying > (2147483512 attempts left). Error: NETWORK_EXCEPTION. Error Message: The > server disconnected before a response was received.{code} > On inspecting the broker, we saw the following errors on brokers still > running Kafka version 3.5.2: > > {code:java} > message: > Closing socket for because of error > exception_exception_class: > org.apache.kafka.common.errors.InvalidRequestException > exception_exception_message: > Received request api key ADD_PARTITIONS_TO_TXN with version 4 which is not > enabled > exception_stacktrace: > org.apache.kafka.common.errors.InvalidRequestException: Received request api > key ADD_PARTITIONS_TO_TXN with version 4 which is not enabled > {code} > On the new brokers running 3.6.1 we saw the following errors: > > {code:java} > [AddPartitionsToTxnSenderThread-1055]: AddPartitionsToTxnRequest failed for > node 1043 with a network exception.{code} > > I can also see this : > {code:java} > [AddPartitionsToTxnManager broker=1055]Cancelled in-flight > ADD_PARTITIONS_TO_TXN request with correlation id 21120 due to node 1043 > being disconnected (elapsed time since creation: 11ms, elapsed time since > send: 4ms, request timeout: 3ms){code} > We started investigating this issue and digging through the changes in 3.6, > we came across some changes introduced as part of KAFKA-14402 that we thought > might lead to this behaviour. > First we could see that _transaction.partition.verification.enable_ is > enabled by default and enables a new code path that culminates in we sending > version 4 ADD_PARTITIONS_TO_TXN requests to other brokers that are generated > [here|#L269]]. > From a > [discussion|https://lists.apache.org/thread/4895wrd1z92kjb708zck4s1f62xq6r8x] > on the mailing list, [~jolshan] pointed out that this scenario shouldn't be > possible as the following code paths should prevent version 4 > ADD_PARTITIONS_TO_TXN requests being sent to other brokers: > [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/clients/src/main/java/org/apache/kafka/clients/NodeApiVersions.java#L130] > > [https://github.com/apache/kafka/blob/525b9b1d7682ae2a527ceca83fedca44b1cba11a/core/src/main/scala/kafka/server/AddPartitionsToTxnManager.scala#L195] > However, this seems to be these requests are still sent to other brokers in > our environment. > On further inspection of the code, I am wondering if the following code path > could lead to this issue: > [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L500] > In this scenario, we don't have any _NodeApiVersions_ available for the > specified nodeId and potentially skipping _latestUsableVersion_ check as > expected. I am wondering if it is possible that because > _discoverBrokerVersions_ is set to false for the network client of the > AddPartitionsToTxnManager, it skips fetching ApiVersions? I can see here that > we create the network client here: > [https://github.com/apache/kafka/blob/c4deed513057c94eb502e64490d6bdc23551d8b6/core/src/main/scala/kafka/server/KafkaServer.scala#L641] > This _NetworkUtils.buildNetworkClient_ seems to create a network client that > has _discoverBrokerVersions_ set to false. > I was hoping I could get some assistance debugging this issue. Happy to > provide any additional information needed. > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16386) NETWORK_EXCEPTIONs from transaction verification are not translated
[ https://issues.apache.org/jira/browse/KAFKA-16386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan updated KAFKA-16386: --- Fix Version/s: 3.7.1 > NETWORK_EXCEPTIONs from transaction verification are not translated > --- > > Key: KAFKA-16386 > URL: https://issues.apache.org/jira/browse/KAFKA-16386 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.6.0, 3.7.0 >Reporter: Sean Quah >Priority: Minor > Fix For: 3.8.0, 3.7.1 > > > KAFKA-14402 > ([KIP-890|https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense]) > adds verification with the transaction coordinator on Produce and > TxnOffsetCommit paths as a defense against hanging transactions. For > compatibility with older clients, retriable errors from the verification step > are translated to ones already expected and handled by existing clients. When > verification was added, we forgot to translate {{NETWORK_EXCEPTION}} s. > [~dajac] noticed this manifesting as a test failure when > tests/kafkatest/tests/core/transactions_test.py was run with an older client > (prior to the fix for KAFKA-16122): > {quote} > {{NETWORK_EXCEPTION}} is indeed returned as a partition error. The > {{TransactionManager.TxnOffsetCommitHandler}} considers it as a fatal error > so it transitions to the fatal state. > It seems that there are two cases where the server could return it: (1) When > the verification request times out or its connections is cut; or (2) in > {{AddPartitionsToTxnManager.addTxnData}} where we say that we use it because > we want a retriable error. > {quote} > The first case was triggered as part of the test. The second case happens > when there is already a verification request ({{AddPartitionsToTxn}}) in > flight with the same epoch and we want clients to try again when we're not > busy. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16570) FenceProducers API returns "unexpected error" when successful
[ https://issues.apache.org/jira/browse/KAFKA-16570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838796#comment-17838796 ] Justine Olshan commented on KAFKA-16570: Hmmm – retries here are a bit strange since the command was successful, we will just return this error until we can allocate the new producer ID. I suppose this "guarantees" the markers were written, but it is not consistent with how end txn usually works. See how the error is return on success of the EndTxn api call? [https://github.com/apache/kafka/blob/53ff1a5a589eb0e30a302724fcf5d0c72c823682/core/src/main/scala/kafka/coordinator/transaction/TransactionCoordinator.scala#L175] > FenceProducers API returns "unexpected error" when successful > - > > Key: KAFKA-16570 > URL: https://issues.apache.org/jira/browse/KAFKA-16570 > Project: Kafka > Issue Type: Bug >Reporter: Justine Olshan >Assignee: Justine Olshan >Priority: Major > > When we want to fence a producer using the admin client, we send an > InitProducerId request. > There is logic in that API to fence (and abort) any ongoing transactions and > that is what the API relies on to fence the producer. However, this handling > also returns CONCURRENT_TRANSACTIONS. In normal usage, this is good because > we want to actually get a new producer ID and want to retry until the the ID > is supplied or we time out. > [https://github.com/apache/kafka/blob/5193eb93237ba9093ae444d73a1eaa2d6abcc9c1/core/src/main/scala/kafka/coordinator/transaction/TransactionCoordinator.scala#L170] > > [https://github.com/apache/kafka/blob/a3dcbd4e28a35f79f75ec1bf316ef0b39c0df164/clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java#L1322] > > In the case of fence producer, we don't retry and instead we have no handling > for concurrent transactions and log a message about an unexpected error. > [https://github.com/apache/kafka/blob/a3dcbd4e28a35f79f75ec1bf316ef0b39c0df164/clients/src/main/java/org/apache/kafka/clients/admin/internals/FenceProducersHandler.java#L112] > > This is not unexpected though and the operation was successful. We should > just swallow this error and treat this as a successful run of the command. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16386) NETWORK_EXCEPTIONs from transaction verification are not translated
[ https://issues.apache.org/jira/browse/KAFKA-16386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan updated KAFKA-16386: --- Affects Version/s: 3.7.0 > NETWORK_EXCEPTIONs from transaction verification are not translated > --- > > Key: KAFKA-16386 > URL: https://issues.apache.org/jira/browse/KAFKA-16386 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.6.0, 3.7.0 >Reporter: Sean Quah >Priority: Minor > Fix For: 3.8.0 > > > KAFKA-14402 > ([KIP-890|https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense]) > adds verification with the transaction coordinator on Produce and > TxnOffsetCommit paths as a defense against hanging transactions. For > compatibility with older clients, retriable errors from the verification step > are translated to ones already expected and handled by existing clients. When > verification was added, we forgot to translate {{NETWORK_EXCEPTION}} s. > [~dajac] noticed this manifesting as a test failure when > tests/kafkatest/tests/core/transactions_test.py was run with an older client > (prior to the fix for KAFKA-16122): > {quote} > {{NETWORK_EXCEPTION}} is indeed returned as a partition error. The > {{TransactionManager.TxnOffsetCommitHandler}} considers it as a fatal error > so it transitions to the fatal state. > It seems that there are two cases where the server could return it: (1) When > the verification request times out or its connections is cut; or (2) in > {{AddPartitionsToTxnManager.addTxnData}} where we say that we use it because > we want a retriable error. > {quote} > The first case was triggered as part of the test. The second case happens > when there is already a verification request ({{AddPartitionsToTxn}}) in > flight with the same epoch and we want clients to try again when we're not > busy. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16386) NETWORK_EXCEPTIONs from transaction verification are not translated
[ https://issues.apache.org/jira/browse/KAFKA-16386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838768#comment-17838768 ] Justine Olshan commented on KAFKA-16386: Note: For 3.6, this is only returned in the produce path which treats the exception as retriable. The TxnOffsetCommit path is the one that experiences the fatal error and is only in 3.7. > NETWORK_EXCEPTIONs from transaction verification are not translated > --- > > Key: KAFKA-16386 > URL: https://issues.apache.org/jira/browse/KAFKA-16386 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.6.0, 3.7.0 >Reporter: Sean Quah >Priority: Minor > Fix For: 3.8.0 > > > KAFKA-14402 > ([KIP-890|https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense]) > adds verification with the transaction coordinator on Produce and > TxnOffsetCommit paths as a defense against hanging transactions. For > compatibility with older clients, retriable errors from the verification step > are translated to ones already expected and handled by existing clients. When > verification was added, we forgot to translate {{NETWORK_EXCEPTION}} s. > [~dajac] noticed this manifesting as a test failure when > tests/kafkatest/tests/core/transactions_test.py was run with an older client > (prior to the fix for KAFKA-16122): > {quote} > {{NETWORK_EXCEPTION}} is indeed returned as a partition error. The > {{TransactionManager.TxnOffsetCommitHandler}} considers it as a fatal error > so it transitions to the fatal state. > It seems that there are two cases where the server could return it: (1) When > the verification request times out or its connections is cut; or (2) in > {{AddPartitionsToTxnManager.addTxnData}} where we say that we use it because > we want a retriable error. > {quote} > The first case was triggered as part of the test. The second case happens > when there is already a verification request ({{AddPartitionsToTxn}}) in > flight with the same epoch and we want clients to try again when we're not > busy. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16570) FenceProducers API returns "unexpected error" when successful
[ https://issues.apache.org/jira/browse/KAFKA-16570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan updated KAFKA-16570: --- Description: When we want to fence a producer using the admin client, we send an InitProducerId request. There is logic in that API to fence (and abort) any ongoing transactions and that is what the API relies on to fence the producer. However, this handling also returns CONCURRENT_TRANSACTIONS. In normal usage, this is good because we want to actually get a new producer ID and want to retry until the the ID is supplied or we time out. [https://github.com/apache/kafka/blob/5193eb93237ba9093ae444d73a1eaa2d6abcc9c1/core/src/main/scala/kafka/coordinator/transaction/TransactionCoordinator.scala#L170] [https://github.com/apache/kafka/blob/a3dcbd4e28a35f79f75ec1bf316ef0b39c0df164/clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java#L1322] In the case of fence producer, we don't retry and instead we have no handling for concurrent transactions and log a message about an unexpected error. [https://github.com/apache/kafka/blob/a3dcbd4e28a35f79f75ec1bf316ef0b39c0df164/clients/src/main/java/org/apache/kafka/clients/admin/internals/FenceProducersHandler.java#L112] This is not unexpected though and the operation was successful. We should just swallow this error and treat this as a successful run of the command. was: When we want to fence a producer using the admin client, we send an InitProducerId request. There is logic in that API to fence (and abort) any ongoing transactions and that is what the API relies on to fence the producer. However, this handling also returns CONCURRENT_TRANSACTIONS. In normal usage, this is good because we want to actually get a new producer ID and want to retry until the the ID is supplied or we time out. [https://github.com/apache/kafka/blob/5193eb93237ba9093ae444d73a1eaa2d6abcc9c1/core/src/main/scala/kafka/coordinator/transaction/TransactionCoordinator.scala#L170] [https://github.com/confluentinc/ce-kafka/blob/b626db8bd94fe971adef3551518761a7be7de454/clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java#L1322] In the case of fence producer, we don't retry and instead we have no handling for concurrent transactions and log a message about an unexpected error. [https://github.com/confluentinc/ce-kafka/blob/b626db8bd94fe971adef3551518761a7be7de454/clients/src/main/java/org/apache/kafka/clients/admin/internals/FenceProducersHandler.java#L112] This is not unexpected though and the operation was successful. We should just swallow this error and treat this as a successful run of the command. > FenceProducers API returns "unexpected error" when successful > - > > Key: KAFKA-16570 > URL: https://issues.apache.org/jira/browse/KAFKA-16570 > Project: Kafka > Issue Type: Bug >Reporter: Justine Olshan >Assignee: Justine Olshan >Priority: Major > > When we want to fence a producer using the admin client, we send an > InitProducerId request. > There is logic in that API to fence (and abort) any ongoing transactions and > that is what the API relies on to fence the producer. However, this handling > also returns CONCURRENT_TRANSACTIONS. In normal usage, this is good because > we want to actually get a new producer ID and want to retry until the the ID > is supplied or we time out. > [https://github.com/apache/kafka/blob/5193eb93237ba9093ae444d73a1eaa2d6abcc9c1/core/src/main/scala/kafka/coordinator/transaction/TransactionCoordinator.scala#L170] > > [https://github.com/apache/kafka/blob/a3dcbd4e28a35f79f75ec1bf316ef0b39c0df164/clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java#L1322] > > In the case of fence producer, we don't retry and instead we have no handling > for concurrent transactions and log a message about an unexpected error. > [https://github.com/apache/kafka/blob/a3dcbd4e28a35f79f75ec1bf316ef0b39c0df164/clients/src/main/java/org/apache/kafka/clients/admin/internals/FenceProducersHandler.java#L112] > > This is not unexpected though and the operation was successful. We should > just swallow this error and treat this as a successful run of the command. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (KAFKA-16571) reassign_partitions_test.bounce_brokers should wait for messages to be sent to every partition
[ https://issues.apache.org/jira/browse/KAFKA-16571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan reassigned KAFKA-16571: -- Assignee: Justine Olshan > reassign_partitions_test.bounce_brokers should wait for messages to be sent > to every partition > -- > > Key: KAFKA-16571 > URL: https://issues.apache.org/jira/browse/KAFKA-16571 > Project: Kafka > Issue Type: Bug >Reporter: David Mao >Assignee: Justine Olshan >Priority: Major > > This particular system test tries to bounce brokers while produce is ongoing. > The test also has rf=3 and min.isr=3 configured, so if any brokers are > bounced before records are produced to every partition, it is possible to run > into OutOfOrderSequence exceptions similar to what is described in > https://issues.apache.org/jira/browse/KAFKA-14359 > When running the produce_consume_validate for the reassign_partitions_test, > instead of waiting for 5 acked messages, we should wait for messages to be > acked on the full set of partitions. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16570) FenceProducers API returns "unexpected error" when successful
[ https://issues.apache.org/jira/browse/KAFKA-16570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan updated KAFKA-16570: --- Description: When we want to fence a producer using the admin client, we send an InitProducerId request. There is logic in that API to fence (and abort) any ongoing transactions and that is what the API relies on to fence the producer. However, this handling also returns CONCURRENT_TRANSACTIONS. In normal usage, this is good because we want to actually get a new producer ID and want to retry until the the ID is supplied or we time out. [https://github.com/apache/kafka/blob/5193eb93237ba9093ae444d73a1eaa2d6abcc9c1/core/src/main/scala/kafka/coordinator/transaction/TransactionCoordinator.scala#L170] [https://github.com/confluentinc/ce-kafka/blob/b626db8bd94fe971adef3551518761a7be7de454/clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java#L1322] In the case of fence producer, we don't retry and instead we have no handling for concurrent transactions and log a message about an unexpected error. [https://github.com/confluentinc/ce-kafka/blob/b626db8bd94fe971adef3551518761a7be7de454/clients/src/main/java/org/apache/kafka/clients/admin/internals/FenceProducersHandler.java#L112] This is not unexpected though and the operation was successful. We should just swallow this error and treat this as a successful run of the command. was: When we want to fence a producer using the admin client, we send an InitProducerId request. There is logic in that API to fence (and abort) any ongoing transactions and that is what the API relies on to fence the producer. However, this handling also returns CONCURRENT_TRANSACTIONS. In normal usage, this is good because we want to actually get a new producer ID and want to retry until the the ID is supplied or we time out. [https://github.com/apache/kafka/blob/5193eb93237ba9093ae444d73a1eaa2d6abcc9c1/core/src/main/scala/kafka/coordinator/transaction/TransactionCoordinator.scala#L170] In the case of fence producer, we don't retry and instead we have no handling for concurrent transactions and log a message about an unexpected error. [https://github.com/confluentinc/ce-kafka/blob/b626db8bd94fe971adef3551518761a7be7de454/clients/src/main/java/org/apache/kafka/clients/admin/internals/FenceProducersHandler.java#L112] This is not unexpected though and the operation was successful. We should just swallow this error and treat this as a successful run of the command. > FenceProducers API returns "unexpected error" when successful > - > > Key: KAFKA-16570 > URL: https://issues.apache.org/jira/browse/KAFKA-16570 > Project: Kafka > Issue Type: Bug >Reporter: Justine Olshan >Assignee: Justine Olshan >Priority: Major > > When we want to fence a producer using the admin client, we send an > InitProducerId request. > There is logic in that API to fence (and abort) any ongoing transactions and > that is what the API relies on to fence the producer. However, this handling > also returns CONCURRENT_TRANSACTIONS. In normal usage, this is good because > we want to actually get a new producer ID and want to retry until the the ID > is supplied or we time out. > [https://github.com/apache/kafka/blob/5193eb93237ba9093ae444d73a1eaa2d6abcc9c1/core/src/main/scala/kafka/coordinator/transaction/TransactionCoordinator.scala#L170] > > [https://github.com/confluentinc/ce-kafka/blob/b626db8bd94fe971adef3551518761a7be7de454/clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java#L1322] > > In the case of fence producer, we don't retry and instead we have no handling > for concurrent transactions and log a message about an unexpected error. > [https://github.com/confluentinc/ce-kafka/blob/b626db8bd94fe971adef3551518761a7be7de454/clients/src/main/java/org/apache/kafka/clients/admin/internals/FenceProducersHandler.java#L112] > > This is not unexpected though and the operation was successful. We should > just swallow this error and treat this as a successful run of the command. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16570) FenceProducers API returns "unexpected error" when successful
[ https://issues.apache.org/jira/browse/KAFKA-16570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837902#comment-17837902 ] Justine Olshan commented on KAFKA-16570: cc: [~ChrisEgerton] [~showuon] [~tombentley] as author and reviewers of the original change. > FenceProducers API returns "unexpected error" when successful > - > > Key: KAFKA-16570 > URL: https://issues.apache.org/jira/browse/KAFKA-16570 > Project: Kafka > Issue Type: Bug >Reporter: Justine Olshan >Assignee: Justine Olshan >Priority: Major > > When we want to fence a producer using the admin client, we send an > InitProducerId request. > There is logic in that API to fence (and abort) any ongoing transactions and > that is what the API relies on to fence the producer. However, this handling > also returns CONCURRENT_TRANSACTIONS. In normal usage, this is good because > we want to actually get a new producer ID and want to retry until the the ID > is supplied or we time out. > [https://github.com/apache/kafka/blob/5193eb93237ba9093ae444d73a1eaa2d6abcc9c1/core/src/main/scala/kafka/coordinator/transaction/TransactionCoordinator.scala#L170] > > [https://github.com/confluentinc/ce-kafka/blob/b626db8bd94fe971adef3551518761a7be7de454/clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java#L1322] > > In the case of fence producer, we don't retry and instead we have no handling > for concurrent transactions and log a message about an unexpected error. > [https://github.com/confluentinc/ce-kafka/blob/b626db8bd94fe971adef3551518761a7be7de454/clients/src/main/java/org/apache/kafka/clients/admin/internals/FenceProducersHandler.java#L112] > > This is not unexpected though and the operation was successful. We should > just swallow this error and treat this as a successful run of the command. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-16570) FenceProducers API returns "unexpected error" when successful
Justine Olshan created KAFKA-16570: -- Summary: FenceProducers API returns "unexpected error" when successful Key: KAFKA-16570 URL: https://issues.apache.org/jira/browse/KAFKA-16570 Project: Kafka Issue Type: Bug Reporter: Justine Olshan Assignee: Justine Olshan When we want to fence a producer using the admin client, we send an InitProducerId request. There is logic in that API to fence (and abort) any ongoing transactions and that is what the API relies on to fence the producer. However, this handling also returns CONCURRENT_TRANSACTIONS. In normal usage, this is good because we want to actually get a new producer ID and want to retry until the the ID is supplied or we time out. [https://github.com/apache/kafka/blob/5193eb93237ba9093ae444d73a1eaa2d6abcc9c1/core/src/main/scala/kafka/coordinator/transaction/TransactionCoordinator.scala#L170] In the case of fence producer, we don't retry and instead we have no handling for concurrent transactions and log a message about an unexpected error. [https://github.com/confluentinc/ce-kafka/blob/b626db8bd94fe971adef3551518761a7be7de454/clients/src/main/java/org/apache/kafka/clients/admin/internals/FenceProducersHandler.java#L112] This is not unexpected though and the operation was successful. We should just swallow this error and treat this as a successful run of the command. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-16451) testDeltaFollower tests failing in ReplicaManager
[ https://issues.apache.org/jira/browse/KAFKA-16451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan resolved KAFKA-16451. Resolution: Duplicate > testDeltaFollower tests failing in ReplicaManager > - > > Key: KAFKA-16451 > URL: https://issues.apache.org/jira/browse/KAFKA-16451 > Project: Kafka > Issue Type: Bug >Reporter: Justine Olshan >Priority: Major > > many ReplicaManagerTests with the prefix testDeltaFollower seem to be > failing. A few other ReplicaManager tests as well. See existing failures in > [https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/2765/tests] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16451) testDeltaFollower tests failing in ReplicaManager
[ https://issues.apache.org/jira/browse/KAFKA-16451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17832330#comment-17832330 ] Justine Olshan commented on KAFKA-16451: closing as it is a duplicate of https://issues.apache.org/jira/browse/KAFKA-16447 > testDeltaFollower tests failing in ReplicaManager > - > > Key: KAFKA-16451 > URL: https://issues.apache.org/jira/browse/KAFKA-16451 > Project: Kafka > Issue Type: Bug >Reporter: Justine Olshan >Priority: Major > > many ReplicaManagerTests with the prefix testDeltaFollower seem to be > failing. A few other ReplicaManager tests as well. See existing failures in > [https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/2765/tests] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-16451) testDeltaFollower tests failing in ReplicaManager
Justine Olshan created KAFKA-16451: -- Summary: testDeltaFollower tests failing in ReplicaManager Key: KAFKA-16451 URL: https://issues.apache.org/jira/browse/KAFKA-16451 Project: Kafka Issue Type: Bug Reporter: Justine Olshan many ReplicaManagerTests with the prefix testDeltaFollower seem to be failing. A few other ReplicaManager tests as well. See existing failures in [https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/2765/tests] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-14089) Flaky ExactlyOnceSourceIntegrationTest.testSeparateOffsetsTopic
[ https://issues.apache.org/jira/browse/KAFKA-14089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830728#comment-17830728 ] Justine Olshan commented on KAFKA-14089: I've seen this again as well. https://ge.apache.org/scans/tests?search.names=Git%20branch=P28D=kafka=America%2FLos_Angeles=trunk=org.apache.kafka.connect.integration.ExactlyOnceSourceIntegrationTest=testSeparateOffsetsTopic > Flaky ExactlyOnceSourceIntegrationTest.testSeparateOffsetsTopic > --- > > Key: KAFKA-14089 > URL: https://issues.apache.org/jira/browse/KAFKA-14089 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Mickael Maison >Priority: Major > Fix For: 3.3.0 > > Attachments: failure.txt, > org.apache.kafka.connect.integration.ExactlyOnceSourceIntegrationTest.testSeparateOffsetsTopic.test.stdout > > > It looks like the sequence got broken around "65535, 65537, 65536, 65539, > 65538, 65541, 65540, 65543" -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15914) Flaky test - testAlterSinkConnectorOffsetsOverriddenConsumerGroupId - OffsetsApiIntegrationTest
[ https://issues.apache.org/jira/browse/KAFKA-15914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830727#comment-17830727 ] Justine Olshan commented on KAFKA-15914: I'm seeing this as well. https://ge.apache.org/scans/tests?search.names=Git%20branch=P28D=kafka=America%2FLos_Angeles=trunk=org.apache.kafka.connect.integration.OffsetsApiIntegrationTest=testAlterSinkConnectorOffsetsOverriddenConsumerGroupId > Flaky test - testAlterSinkConnectorOffsetsOverriddenConsumerGroupId - > OffsetsApiIntegrationTest > --- > > Key: KAFKA-15914 > URL: https://issues.apache.org/jira/browse/KAFKA-15914 > Project: Kafka > Issue Type: Bug > Components: connect >Reporter: Lucas Brutschy >Priority: Major > Labels: flaky-test > > Test intermittently gives the following result: > {code} > org.opentest4j.AssertionFailedError: Condition not met within timeout 3. > Sink connector consumer group offsets should catch up to the topic end > offsets ==> expected: but was: > at > org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151) > at > org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132) > at org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63) > at org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36) > at org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:210) > at > org.apache.kafka.test.TestUtils.lambda$waitForCondition$3(TestUtils.java:331) > at > org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:379) > at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:328) > at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:312) > at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:302) > at > org.apache.kafka.connect.integration.OffsetsApiIntegrationTest.verifyExpectedSinkConnectorOffsets(OffsetsApiIntegrationTest.java:917) > at > org.apache.kafka.connect.integration.OffsetsApiIntegrationTest.alterAndVerifySinkConnectorOffsets(OffsetsApiIntegrationTest.java:396) > at > org.apache.kafka.connect.integration.OffsetsApiIntegrationTest.testAlterSinkConnectorOffsetsOverriddenConsumerGroupId(OffsetsApiIntegrationTest.java:297) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (KAFKA-14359) Idempotent Producer continues to retry on OutOfOrderSequence error when first batch fails
[ https://issues.apache.org/jira/browse/KAFKA-14359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan reassigned KAFKA-14359: -- Assignee: Justine Olshan > Idempotent Producer continues to retry on OutOfOrderSequence error when first > batch fails > - > > Key: KAFKA-14359 > URL: https://issues.apache.org/jira/browse/KAFKA-14359 > Project: Kafka > Issue Type: Task >Reporter: Justine Olshan >Assignee: Justine Olshan >Priority: Major > > When the idempotent producer does not have any state it can fall into a state > where the producer keeps retrying an out of order sequence. Consider the > following scenario where an idempotent producer has retries and delivery > timeout are int max (a configuration used in streams). > 1. A producer send out several batches (up to 5) with the first one starting > at sequence 0. > 2. The first batch with sequence 0 fails due to a transient error (ie, > NOT_LEADER_OR_FOLLOWER or a timeout error) > 3. The second batch, say with sequence 200 comes in. Since there is no > previous state to invalidate it, it gets written to the log > 4. The original batch is retried and will get an out of order sequence number > 5. Current java client will continue to retry this batch, but it will never > resolve. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16360) Release plan of 3.x kafka releases.
[ https://issues.apache.org/jira/browse/KAFKA-16360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17825386#comment-17825386 ] Justine Olshan commented on KAFKA-16360: Hey there – there was some discussion on the mailing list. 3.8 should be the last release. See here: [https://lists.apache.org/thread/kvdp2gmq5gd9txkvxh5vk3z2n55b04s5] There is also a KIP. KIP-1012: [https://cwiki.apache.org/confluence/display/KAFKA/KIP-1012%3A+The+need+for+a+Kafka+3.8.x+release] Hope this clears things up. 3.8 should be the last release. > Release plan of 3.x kafka releases. > --- > > Key: KAFKA-16360 > URL: https://issues.apache.org/jira/browse/KAFKA-16360 > Project: Kafka > Issue Type: Improvement >Reporter: kaushik srinivas >Priority: Major > > KIP > [https://cwiki.apache.org/confluence/display/KAFKA/KIP-833%3A+Mark+KRaft+as+Production+Ready#KIP833:MarkKRaftasProductionReady-ReleaseTimeline] > mentions , > h2. Kafka 3.7 > * January 2024 > * Final release with ZK mode > But we see in Jira, some tickets are marked for 3.8 release. Does apache > continue to make 3.x releases having zookeeper and kraft supported > independent of pure kraft 4.x releases ? > If yes, how many more releases can be expected on 3.x release line ? > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16352) Transaction may get get stuck in PrepareCommit or PrepareAbort state
[ https://issues.apache.org/jira/browse/KAFKA-16352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824512#comment-17824512 ] Justine Olshan commented on KAFKA-16352: Thanks for filing [~alivshits]. If I understand correctly, the transaction is completed on the data partition side (in terms of writing the marker, lso, etc) but the coordinator is not able to complete the final book-keeping so we are unable to continue using the transactional id. > Transaction may get get stuck in PrepareCommit or PrepareAbort state > > > Key: KAFKA-16352 > URL: https://issues.apache.org/jira/browse/KAFKA-16352 > Project: Kafka > Issue Type: Bug > Components: core >Reporter: Artem Livshits >Assignee: Artem Livshits >Priority: Major > > A transaction took a long time to complete, trying to restart a producer > would lead to CONCURRENT_TRANSACTION errors. Investigation has shown that > the transaction was stuck in PrepareCommit for a few days: > (current time when the investigation happened: Feb 27 2024), transaction > state: > {{Type |Name |Value}} > {{-}} > {{ref |transactionalId |xxx-yyy}} > {{long |producerId |299364}} > {{ref |state |kafka.coordinator.transaction.PrepareCommit$ > @ 0x44fe22760}} > {{long |txnStartTimestamp |1708619624810 Thu Feb 22 2024 16:33:44.810 > GMT+}} > {{long |txnLastUpdateTimestamp|1708619625335 Thu Feb 22 2024 16:33:45.335 > GMT+}} > {{-}} > The partition list was empty and transactionsWithPendingMarkers didn't > contain the reference to the transactional state. In the log there were the > following relevant messages: > {{22 Feb 2024 @ 16:33:45.623 UTC [Transaction State Manager 1]: Completed > loading transaction metadata from __transaction_state-3 for coordinator epoch > 611}} > (this is the partition that contains the transactional id). After the data > is loaded, it sends out markers and etc. > Then there is this message: > {{22 Feb 2024 @ 16:33:45.696 UTC [Transaction Marker Request Completion > Handler 4]: Transaction coordinator epoch for xxx-yyy has changed from 610 to > 611; cancel sending transaction markers TxnMarkerEntry\{producerId=299364, > producerEpoch=1005, coordinatorEpoch=610, result=COMMIT, > partitions=[foo-bar]} to the brokers}} > this message is logged just before the state is removed > transactionsWithPendingMarkers, but the state apparently contained the entry > that was created by the load operation. So the sequence of events probably > looked like the following: > # partition load completed > # commit markers were sent for transactional id xxx-yyy; entry in > transactionsWithPendingMarkers was created > # zombie reply from the previous epoch completed, removed entry from > transactionsWithPendingMarkers > # commit markers properly completed, but couldn't transition to > CommitComplete state because transactionsWithPendingMarkers didn't have the > proper entry, so it got stuck there until the broker was restarted > Looking at the code there are a few cases that could lead to similar race > conditions. The fix it to keep track of the PendingCompleteTxn value that > was used when sending the marker, so that we can only remove the state that > was created when the marker was sent and not accidentally remove the state > someone else created. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-16308) Formatting and Updating Kafka Features
Justine Olshan created KAFKA-16308: -- Summary: Formatting and Updating Kafka Features Key: KAFKA-16308 URL: https://issues.apache.org/jira/browse/KAFKA-16308 Project: Kafka Issue Type: Task Reporter: Justine Olshan Assignee: Justine Olshan As part of KIP-1022, we need to extend the storage and upgrade tools to support features other than metadata version. See https://cwiki.apache.org/confluence/display/KAFKA/KIP-1023%3A+Formatting+and+Updating+Features -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16282) Allow to get last stable offset (LSO) in kafka-get-offsets.sh
[ https://issues.apache.org/jira/browse/KAFKA-16282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17820821#comment-17820821 ] Justine Olshan commented on KAFKA-16282: Thanks [~ahmedsobeh] A few things: For the Admin.java proposed change, picking one of the options and listing the other as a rejected alternative. We may go with one or the other, but it is good to take an opinion for folks to agree or disagree with. For the compatibility section, we could have compatibility issues if someone uses this new field on a broker that doesn't support it. We should probably throw and error in that case. Generally though, I think the KIP is ready for discussion, I may have further comments for you on the mailing list :) > Allow to get last stable offset (LSO) in kafka-get-offsets.sh > - > > Key: KAFKA-16282 > URL: https://issues.apache.org/jira/browse/KAFKA-16282 > Project: Kafka > Issue Type: Improvement >Reporter: Luke Chen >Assignee: Ahmed Sobeh >Priority: Major > Labels: need-kip, newbie, newbie++ > > Currently, when using `kafka-get-offsets.sh` to get the offset by time, we > have these choices: > {code:java} > --time / timestamp of the offsets before > that. > -1 or latest / [Note: No offset is returned, if > the > -2 or earliest / timestamp greater than recently > -3 or max-timestamp /committed record timestamp is > -4 or earliest-local / given.] (default: latest) > -5 or latest-tiered > {code} > For the "latest" option, it'll always return the "high watermark" because we > always send with the default option: *IsolationLevel.READ_UNCOMMITTED*. It > would be good if the command can support to get the last stable offset (LSO) > for transaction support. That is, sending the option with > *IsolationLevel.READ_COMMITTED* -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-16302) Builds failing due to streams test execution failures
[ https://issues.apache.org/jira/browse/KAFKA-16302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan resolved KAFKA-16302. Resolution: Fixed > Builds failing due to streams test execution failures > - > > Key: KAFKA-16302 > URL: https://issues.apache.org/jira/browse/KAFKA-16302 > Project: Kafka > Issue Type: Task > Components: streams, unit tests >Reporter: Justine Olshan >Assignee: Justine Olshan >Priority: Major > > I'm seeing this on master and many PR builds for all versions: > > {code:java} > [2024-02-22T14:37:07.076Z] * What went wrong: > https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1426[2024-02-22T14:37:07.076Z] > Execution failed for task ':streams:test'. > https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1427[2024-02-22T14:37:07.076Z] > > The following test methods could not be retried, which is unexpected. > Please file a bug report at > https://github.com/gradle/test-retry-gradle-plugin/issues > https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1428[2024-02-22T14:37:07.076Z] > > org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.SessionKeySchema@78d39a69] > https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1429[2024-02-22T14:37:07.076Z] > > org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.WindowKeySchema@3c818ac4] > https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1430[2024-02-22T14:37:07.076Z] > > org.apache.kafka.streams.state.internals.RocksDBTimestampedSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.WindowKeySchema@251f7d26] > https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1431[2024-02-22T14:37:07.076Z] > > org.apache.kafka.streams.state.internals.RocksDBTimestampedSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.SessionKeySchema@52c8295b] > https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1432[2024-02-22T14:37:07.076Z] > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16302) Builds failing due to streams test execution failures
[ https://issues.apache.org/jira/browse/KAFKA-16302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819835#comment-17819835 ] Justine Olshan commented on KAFKA-16302: Root cause is [https://github.com/apache/kafka/pull/15396] removing the log line needed for these tests to pass {code:java} java.lang.AssertionError: Expected: a collection containing "Skipping record for expired segment." but: was empty{code} > Builds failing due to streams test execution failures > - > > Key: KAFKA-16302 > URL: https://issues.apache.org/jira/browse/KAFKA-16302 > Project: Kafka > Issue Type: Task >Reporter: Justine Olshan >Priority: Major > > I'm seeing this on master and many PR builds for all versions: > > {code:java} > [2024-02-22T14:37:07.076Z] * What went wrong: > https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1426[2024-02-22T14:37:07.076Z] > Execution failed for task ':streams:test'. > https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1427[2024-02-22T14:37:07.076Z] > > The following test methods could not be retried, which is unexpected. > Please file a bug report at > https://github.com/gradle/test-retry-gradle-plugin/issues > https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1428[2024-02-22T14:37:07.076Z] > > org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.SessionKeySchema@78d39a69] > https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1429[2024-02-22T14:37:07.076Z] > > org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.WindowKeySchema@3c818ac4] > https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1430[2024-02-22T14:37:07.076Z] > > org.apache.kafka.streams.state.internals.RocksDBTimestampedSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.WindowKeySchema@251f7d26] > https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1431[2024-02-22T14:37:07.076Z] > > org.apache.kafka.streams.state.internals.RocksDBTimestampedSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.SessionKeySchema@52c8295b] > https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1432[2024-02-22T14:37:07.076Z] > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16302) Builds failing due to streams test execution failures
[ https://issues.apache.org/jira/browse/KAFKA-16302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan updated KAFKA-16302: --- Description: I'm seeing this on master and many PR builds for all versions: {code:java} [2024-02-22T14:37:07.076Z] * What went wrong: https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1426[2024-02-22T14:37:07.076Z] Execution failed for task ':streams:test'. https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1427[2024-02-22T14:37:07.076Z] > The following test methods could not be retried, which is unexpected. Please file a bug report at https://github.com/gradle/test-retry-gradle-plugin/issues https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1428[2024-02-22T14:37:07.076Z] org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.SessionKeySchema@78d39a69] https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1429[2024-02-22T14:37:07.076Z] org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.WindowKeySchema@3c818ac4] https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1430[2024-02-22T14:37:07.076Z] org.apache.kafka.streams.state.internals.RocksDBTimestampedSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.WindowKeySchema@251f7d26] https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1431[2024-02-22T14:37:07.076Z] org.apache.kafka.streams.state.internals.RocksDBTimestampedSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.SessionKeySchema@52c8295b] https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1432[2024-02-22T14:37:07.076Z] {code} was: I'm seeing this on master and many PR builds for all versions: ``` [2024-02-22T14:37:07.076Z] * What went wrong: [|https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1426][2024-02-22T14:37:07.076Z] Execution failed for task ':streams:test'. [|https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1427][2024-02-22T14:37:07.076Z] > The following test methods could not be retried, which is unexpected. Please file a bug report at [https://github.com/gradle/test-retry-gradle-plugin/issues] [|https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1428][2024-02-22T14:37:07.076Z] org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.SessionKeySchema@78d39a69] [|https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1429][2024-02-22T14:37:07.076Z] org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.WindowKeySchema@3c818ac4] [|https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1430][2024-02-22T14:37:07.076Z] org.apache.kafka.streams.state.internals.RocksDBTimestampedSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.WindowKeySchema@251f7d26] [|https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1431][2024-02-22T14:37:07.076Z] org.apache.kafka.streams.state.internals.RocksDBTimestampedSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.SessionKeySchema@52c8295b] ``` [|https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1432][2024-02-22T14:37:07.076Z] > Builds failing due to streams test execution failures > - > > Key: KAFKA-16302 > URL: https://issues.apache.org/jira/browse/KAFKA-16302 > Project: Kafka > Issue Type: Task >Reporter: Justine Olshan >Priority: Major > > I'm seeing this on master and many PR builds for all versions: > > {code:java} > [2024-02-22T14:37:07.076Z] * What went wrong: >
[jira] [Created] (KAFKA-16302) Builds failing due to streams test execution failures
Justine Olshan created KAFKA-16302: -- Summary: Builds failing due to streams test execution failures Key: KAFKA-16302 URL: https://issues.apache.org/jira/browse/KAFKA-16302 Project: Kafka Issue Type: Task Reporter: Justine Olshan I'm seeing this on master and many PR builds for all versions: ``` [2024-02-22T14:37:07.076Z] * What went wrong: [|https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1426][2024-02-22T14:37:07.076Z] Execution failed for task ':streams:test'. [|https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1427][2024-02-22T14:37:07.076Z] > The following test methods could not be retried, which is unexpected. Please file a bug report at [https://github.com/gradle/test-retry-gradle-plugin/issues] [|https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1428][2024-02-22T14:37:07.076Z] org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.SessionKeySchema@78d39a69] [|https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1429][2024-02-22T14:37:07.076Z] org.apache.kafka.streams.state.internals.RocksDBSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.WindowKeySchema@3c818ac4] [|https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1430][2024-02-22T14:37:07.076Z] org.apache.kafka.streams.state.internals.RocksDBTimestampedSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.WindowKeySchema@251f7d26] [|https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1431][2024-02-22T14:37:07.076Z] org.apache.kafka.streams.state.internals.RocksDBTimestampedSegmentedBytesStoreTest#shouldLogAndMeasureExpiredRecords[org.apache.kafka.streams.state.internals.SessionKeySchema@52c8295b] ``` [|https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15417/1/pipeline#step-89-log-1432][2024-02-22T14:37:07.076Z] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-15665) Enforce ISR to have all target replicas when complete partition reassignment
[ https://issues.apache.org/jira/browse/KAFKA-15665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan resolved KAFKA-15665. Resolution: Fixed > Enforce ISR to have all target replicas when complete partition reassignment > > > Key: KAFKA-15665 > URL: https://issues.apache.org/jira/browse/KAFKA-15665 > Project: Kafka > Issue Type: Sub-task >Reporter: Calvin Liu >Assignee: Calvin Liu >Priority: Major > > Current partition reassignment can be completed when the new ISR is under min > ISR. We should fix this behavior. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (KAFKA-15898) Flaky test: testFenceMultipleBrokers() – org.apache.kafka.controller.QuorumControllerTest
[ https://issues.apache.org/jira/browse/KAFKA-15898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819351#comment-17819351 ] Justine Olshan edited comment on KAFKA-15898 at 2/21/24 6:26 PM: - I have 4 failures on one run [https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15384/8/tests] was (Author: jolshan): I have 8 failures on one run https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15384/8/tests > Flaky test: testFenceMultipleBrokers() – > org.apache.kafka.controller.QuorumControllerTest > - > > Key: KAFKA-15898 > URL: https://issues.apache.org/jira/browse/KAFKA-15898 > Project: Kafka > Issue Type: Bug >Reporter: Apoorv Mittal >Priority: Major > Labels: flaky-test > > Build run: > [https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-14699/21/tests/] > > {code:java} > java.util.concurrent.TimeoutException: testFenceMultipleBrokers() timed out > after 40 secondsStacktracejava.util.concurrent.TimeoutException: > testFenceMultipleBrokers() timed out after 40 secondsat > org.junit.jupiter.engine.extension.TimeoutExceptionFactory.create(TimeoutExceptionFactory.java:29) >at > org.junit.jupiter.engine.extension.SameThreadTimeoutInvocation.proceed(SameThreadTimeoutInvocation.java:58) > at > org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:156) > at > org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:147) >at > org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestMethod(TimeoutExtension.java:86) > at > org.junit.jupiter.engine.execution.InterceptingExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(InterceptingExecutableInvoker.java:103) > at > org.junit.jupiter.engine.execution.InterceptingExecutableInvoker.lambda$invoke$0(InterceptingExecutableInvoker.java:93) > at > org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106) > at > org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64) > at > org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45) > at > org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37) > at > org.junit.jupiter.engine.execution.InterceptingExecutableInvoker.invoke(InterceptingExecutableInvoker.java:92) >at > org.junit.jupiter.engine.execution.InterceptingExecutableInvoker.invoke(InterceptingExecutableInvoker.java:86) >at > org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$7(TestMethodTestDescriptor.java:218) > at > org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73) > at > org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:214) > at > org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:139) > at > org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:69) >at > org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$6(NodeTestTask.java:151) >at > org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73) > at > org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$8(NodeTestTask.java:141) >at > org.junit.platform.engine.support.hierarchical.Node.around(Node.java:137) > at > org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$9(NodeTestTask.java:139) >at > org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73) > at > org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:138) > at > org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:95) > at java.base/java.util.ArrayList.forEach(ArrayList.java:1511) at > org.junit.platform.engine.support.hierarchical.SameThreadHierarchicalTestExecutorService.invokeAll(SameThreadHierarchicalTestExecutorService.java:41) > at > org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$6(NodeTestTask.java:155) >at > org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73) > at >
[jira] [Commented] (KAFKA-15898) Flaky test: testFenceMultipleBrokers() – org.apache.kafka.controller.QuorumControllerTest
[ https://issues.apache.org/jira/browse/KAFKA-15898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819351#comment-17819351 ] Justine Olshan commented on KAFKA-15898: I have 8 failures on one run https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15384/8/tests > Flaky test: testFenceMultipleBrokers() – > org.apache.kafka.controller.QuorumControllerTest > - > > Key: KAFKA-15898 > URL: https://issues.apache.org/jira/browse/KAFKA-15898 > Project: Kafka > Issue Type: Bug >Reporter: Apoorv Mittal >Priority: Major > Labels: flaky-test > > Build run: > [https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-14699/21/tests/] > > {code:java} > java.util.concurrent.TimeoutException: testFenceMultipleBrokers() timed out > after 40 secondsStacktracejava.util.concurrent.TimeoutException: > testFenceMultipleBrokers() timed out after 40 secondsat > org.junit.jupiter.engine.extension.TimeoutExceptionFactory.create(TimeoutExceptionFactory.java:29) >at > org.junit.jupiter.engine.extension.SameThreadTimeoutInvocation.proceed(SameThreadTimeoutInvocation.java:58) > at > org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:156) > at > org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:147) >at > org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestMethod(TimeoutExtension.java:86) > at > org.junit.jupiter.engine.execution.InterceptingExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(InterceptingExecutableInvoker.java:103) > at > org.junit.jupiter.engine.execution.InterceptingExecutableInvoker.lambda$invoke$0(InterceptingExecutableInvoker.java:93) > at > org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106) > at > org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64) > at > org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45) > at > org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37) > at > org.junit.jupiter.engine.execution.InterceptingExecutableInvoker.invoke(InterceptingExecutableInvoker.java:92) >at > org.junit.jupiter.engine.execution.InterceptingExecutableInvoker.invoke(InterceptingExecutableInvoker.java:86) >at > org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$7(TestMethodTestDescriptor.java:218) > at > org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73) > at > org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:214) > at > org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:139) > at > org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:69) >at > org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$6(NodeTestTask.java:151) >at > org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73) > at > org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$8(NodeTestTask.java:141) >at > org.junit.platform.engine.support.hierarchical.Node.around(Node.java:137) > at > org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$9(NodeTestTask.java:139) >at > org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73) > at > org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:138) > at > org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:95) > at java.base/java.util.ArrayList.forEach(ArrayList.java:1511) at > org.junit.platform.engine.support.hierarchical.SameThreadHierarchicalTestExecutorService.invokeAll(SameThreadHierarchicalTestExecutorService.java:41) > at > org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$6(NodeTestTask.java:155) >at > org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73) > at > org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$8(NodeTestTask.java:141) >at > org.junit.platform.engine.support.hierarchical.Node.around(Node.java:137)
[jira] [Commented] (KAFKA-16283) RoundRobinPartitioner will only send to half of the partitions in a topic
[ https://issues.apache.org/jira/browse/KAFKA-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819325#comment-17819325 ] Justine Olshan commented on KAFKA-16283: also https://issues.apache.org/jira/browse/KAFKA-14573 > RoundRobinPartitioner will only send to half of the partitions in a topic > - > > Key: KAFKA-16283 > URL: https://issues.apache.org/jira/browse/KAFKA-16283 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.1.0, 3.0.0, 3.6.1 >Reporter: Luke Chen >Priority: Major > > When using `org.apache.kafka.clients.producer.RoundRobinPartitioner`, we > expect data are sent to all partitions in round-robin manner. But we found > there are only half of the partitions got the data. This causes half of the > resources(storage, consumer...) are wasted. > {code:java} > > bin/kafka-topics.sh --create --topic quickstart-events4 --bootstrap-server > > localhost:9092 --partitions 2 > Created topic quickstart-events4. > # send 1000 records to the topic, expecting 500 records in partition0, and > 500 records in partition1 > > bin/kafka-producer-perf-test.sh --topic quickstart-events4 --num-records > > 1000 --record-size 1024 --throughput -1 --producer-props > > bootstrap.servers=localhost:9092 > > partitioner.class=org.apache.kafka.clients.producer.RoundRobinPartitioner > 1000 records sent, 6535.947712 records/sec (6.38 MB/sec), 2.88 ms avg > latency, 121.00 ms max latency, 2 ms 50th, 7 ms 95th, 10 ms 99th, 121 ms > 99.9th. > > ls -al /tmp/kafka-logs/quickstart-events4-1 > total 24 > drwxr-xr-x 7 lukchen wheel 224 2 20 19:53 . > drwxr-xr-x 70 lukchen wheel 2240 2 20 19:53 .. > -rw-r--r-- 1 lukchen wheel 10485760 2 20 19:53 .index > -rw-r--r-- 1 lukchen wheel 1037819 2 20 19:53 .log > -rw-r--r-- 1 lukchen wheel 10485756 2 20 19:53 > .timeindex > -rw-r--r-- 1 lukchen wheel 8 2 20 19:53 leader-epoch-checkpoint > -rw-r--r-- 1 lukchen wheel43 2 20 19:53 partition.metadata > # No records in partition 1 > > ls -al /tmp/kafka-logs/quickstart-events4-0 > total 8 > drwxr-xr-x 7 lukchen wheel 224 2 20 19:53 . > drwxr-xr-x 70 lukchen wheel 2240 2 20 19:53 .. > -rw-r--r-- 1 lukchen wheel 10485760 2 20 19:53 .index > -rw-r--r-- 1 lukchen wheel 0 2 20 19:53 .log > -rw-r--r-- 1 lukchen wheel 10485756 2 20 19:53 > .timeindex > -rw-r--r-- 1 lukchen wheel 0 2 20 19:53 leader-epoch-checkpoint > -rw-r--r-- 1 lukchen wheel43 2 20 19:53 partition.metadata > {code} > Tested in kafka 3.0.0, 3.2.3, and the latest trunk, they all have the same > issue. It should already exist for a long time. > > Had a quick look, it's because we will abortOnNewBatch each time when new > batch created. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16283) RoundRobinPartitioner will only send to half of the partitions in a topic
[ https://issues.apache.org/jira/browse/KAFKA-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819324#comment-17819324 ] Justine Olshan commented on KAFKA-16283: https://issues.apache.org/jira/browse/KAFKA-13359 Is another dupe? > RoundRobinPartitioner will only send to half of the partitions in a topic > - > > Key: KAFKA-16283 > URL: https://issues.apache.org/jira/browse/KAFKA-16283 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.1.0, 3.0.0, 3.6.1 >Reporter: Luke Chen >Priority: Major > > When using `org.apache.kafka.clients.producer.RoundRobinPartitioner`, we > expect data are sent to all partitions in round-robin manner. But we found > there are only half of the partitions got the data. This causes half of the > resources(storage, consumer...) are wasted. > {code:java} > > bin/kafka-topics.sh --create --topic quickstart-events4 --bootstrap-server > > localhost:9092 --partitions 2 > Created topic quickstart-events4. > # send 1000 records to the topic, expecting 500 records in partition0, and > 500 records in partition1 > > bin/kafka-producer-perf-test.sh --topic quickstart-events4 --num-records > > 1000 --record-size 1024 --throughput -1 --producer-props > > bootstrap.servers=localhost:9092 > > partitioner.class=org.apache.kafka.clients.producer.RoundRobinPartitioner > 1000 records sent, 6535.947712 records/sec (6.38 MB/sec), 2.88 ms avg > latency, 121.00 ms max latency, 2 ms 50th, 7 ms 95th, 10 ms 99th, 121 ms > 99.9th. > > ls -al /tmp/kafka-logs/quickstart-events4-1 > total 24 > drwxr-xr-x 7 lukchen wheel 224 2 20 19:53 . > drwxr-xr-x 70 lukchen wheel 2240 2 20 19:53 .. > -rw-r--r-- 1 lukchen wheel 10485760 2 20 19:53 .index > -rw-r--r-- 1 lukchen wheel 1037819 2 20 19:53 .log > -rw-r--r-- 1 lukchen wheel 10485756 2 20 19:53 > .timeindex > -rw-r--r-- 1 lukchen wheel 8 2 20 19:53 leader-epoch-checkpoint > -rw-r--r-- 1 lukchen wheel43 2 20 19:53 partition.metadata > # No records in partition 1 > > ls -al /tmp/kafka-logs/quickstart-events4-0 > total 8 > drwxr-xr-x 7 lukchen wheel 224 2 20 19:53 . > drwxr-xr-x 70 lukchen wheel 2240 2 20 19:53 .. > -rw-r--r-- 1 lukchen wheel 10485760 2 20 19:53 .index > -rw-r--r-- 1 lukchen wheel 0 2 20 19:53 .log > -rw-r--r-- 1 lukchen wheel 10485756 2 20 19:53 > .timeindex > -rw-r--r-- 1 lukchen wheel 0 2 20 19:53 leader-epoch-checkpoint > -rw-r--r-- 1 lukchen wheel43 2 20 19:53 partition.metadata > {code} > Tested in kafka 3.0.0, 3.2.3, and the latest trunk, they all have the same > issue. It should already exist for a long time. > > Had a quick look, it's because we will abortOnNewBatch each time when new > batch created. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16212) Cache partitions by TopicIdPartition instead of TopicPartition
[ https://issues.apache.org/jira/browse/KAFKA-16212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819023#comment-17819023 ] Justine Olshan commented on KAFKA-16212: We use something similar in the fetch session cache when the topic ID is unknown. This transition state will be as long as we support ZK I would suspect. > how would this look like that I need to be aware off during extending > ReplicaManager cache to be topicId aware Not sure I understand this question. > Cache partitions by TopicIdPartition instead of TopicPartition > -- > > Key: KAFKA-16212 > URL: https://issues.apache.org/jira/browse/KAFKA-16212 > Project: Kafka > Issue Type: Improvement >Affects Versions: 3.7.0 >Reporter: Gaurav Narula >Assignee: Omnia Ibrahim >Priority: Major > > From the discussion in [PR > 15263|https://github.com/apache/kafka/pull/15263#discussion_r1471075201], it > would be better to cache {{allPartitions}} by {{TopicIdPartition}} instead of > {{TopicPartition}} to avoid ambiguity. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16282) Allow to get last stable offset (LSO) in kafka-get-offsets.sh
[ https://issues.apache.org/jira/browse/KAFKA-16282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17818903#comment-17818903 ] Justine Olshan commented on KAFKA-16282: I am also happy to review the KIP and PRs :) > Allow to get last stable offset (LSO) in kafka-get-offsets.sh > - > > Key: KAFKA-16282 > URL: https://issues.apache.org/jira/browse/KAFKA-16282 > Project: Kafka > Issue Type: Improvement >Reporter: Luke Chen >Assignee: Ahmed Sobeh >Priority: Major > Labels: need-kip, newbie, newbie++ > > Currently, when using `kafka-get-offsets.sh` to get the offset by time, we > have these choices: > {code:java} > --time / timestamp of the offsets before > that. > -1 or latest / [Note: No offset is returned, if > the > -2 or earliest / timestamp greater than recently > -3 or max-timestamp /committed record timestamp is > -4 or earliest-local / given.] (default: latest) > -5 or latest-tiered > {code} > For the "latest" option, it'll always return the "high watermark" because we > always send with the default option: *IsolationLevel.READ_UNCOMMITTED*. It > would be good if the command can support to get the last stable offset (LSO) > for transaction support. That is, sending the option with > *IsolationLevel.READ_COMMITTED* -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16264) Expose `producer.id.expiration.check.interval.ms` as dynamic broker configuration
[ https://issues.apache.org/jira/browse/KAFKA-16264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17818064#comment-17818064 ] Justine Olshan commented on KAFKA-16264: Hey [~jeqo] thanks for filing. I was also thinking about this after you filed the first ticket. One thing that is interesting is right now the expiration is scheduled on a separate thread on startup. I guess the best course of action is to cancel that scheduled task and create a new one? See UnifiedLog producerExpireCheck. > Expose `producer.id.expiration.check.interval.ms` as dynamic broker > configuration > - > > Key: KAFKA-16264 > URL: https://issues.apache.org/jira/browse/KAFKA-16264 > Project: Kafka > Issue Type: Improvement >Reporter: Jorge Esteban Quilcate Otoya >Assignee: Jorge Esteban Quilcate Otoya >Priority: Major > > Dealing with a scenario where too many producer ids lead to issues (e.g. high > cpu utilization, see KAFKA-16229) put operators in need to flush producer ids > more promptly than usual. > Currently, only the expiration timeout `producer.id.expiration.ms` is exposed > as dynamic config. This is helpful (e.g. by reducing the timeout, less > producer would eventually be kept in memory), but not enough if the > evaluation frequency is not sufficiently short to flush producer ids before > becoming an issue. Only by tuning both, the issue could be workaround. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16212) Cache partitions by TopicIdPartition instead of TopicPartition
[ https://issues.apache.org/jira/browse/KAFKA-16212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17817795#comment-17817795 ] Justine Olshan commented on KAFKA-16212: Thanks [~omnia_h_ibrahim] I remember this particularly being a tricky area for topic IDs, so I appreciate the time being spent here. For fetch requests, I remember we had to do something similar to proposal 1 for the fetch path. When we receive fetch request < 13, we will have a zero topic ID stored. (Note – this is a placeholder and can not be used as a valid topic ID) Does using the zero uuid make some of option 1 easier? Or are we already using the zero uuid to signify something? > Cache partitions by TopicIdPartition instead of TopicPartition > -- > > Key: KAFKA-16212 > URL: https://issues.apache.org/jira/browse/KAFKA-16212 > Project: Kafka > Issue Type: Improvement >Affects Versions: 3.7.0 >Reporter: Gaurav Narula >Assignee: Omnia Ibrahim >Priority: Major > > From the discussion in [PR > 15263|https://github.com/apache/kafka/pull/15263#discussion_r1471075201], it > would be better to cache {{allPartitions}} by {{TopicIdPartition}} instead of > {{TopicPartition}} to avoid ambiguity. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-16245) DescribeConsumerGroupTest failing
Justine Olshan created KAFKA-16245: -- Summary: DescribeConsumerGroupTest failing Key: KAFKA-16245 URL: https://issues.apache.org/jira/browse/KAFKA-16245 Project: Kafka Issue Type: Task Reporter: Justine Olshan The first instances on trunk are in this PR [https://github.com/apache/kafka/pull/15275] And this PR seems to have it failing consistently in the builds when it wasn't failing this consistently before. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-16229) Slow expiration of Producer IDs leading to high CPU usage
[ https://issues.apache.org/jira/browse/KAFKA-16229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan resolved KAFKA-16229. Resolution: Fixed > Slow expiration of Producer IDs leading to high CPU usage > - > > Key: KAFKA-16229 > URL: https://issues.apache.org/jira/browse/KAFKA-16229 > Project: Kafka > Issue Type: Bug >Reporter: Jorge Esteban Quilcate Otoya >Assignee: Jorge Esteban Quilcate Otoya >Priority: Major > > Expiration of ProducerIds is implemented with a slow removal of map keys: > ``` > producers.keySet().removeAll(keys); > ``` > Unnecessarily going through all producer ids and then throw all expired keys > to be removed. > This leads to exponential time on worst case when most/all keys need to be > removed: > ``` > Benchmark (numProducerIds) Mode Cnt > Score Error Units > ProducerStateManagerBench.testDeleteExpiringIds 100 avgt 3 > 9164.043 ± 10647.877 ns/op > ProducerStateManagerBench.testDeleteExpiringIds 1000 avgt 3 > 341561.093 ± 20283.211 ns/op > ProducerStateManagerBench.testDeleteExpiringIds 1 avgt 3 > 44957983.550 ± 9389011.290 ns/op > ProducerStateManagerBench.testDeleteExpiringIds 10 avgt 3 > 5683374164.167 ± 1446242131.466 ns/op > ``` > A simple fix is to use map#remove(key) instead, leading to a more linear > growth: > ``` > Benchmark (numProducerIds) Mode Cnt > Score Error Units > ProducerStateManagerBench.testDeleteExpiringIds 100 avgt 3 > 5779.056 ± 651.389 ns/op > ProducerStateManagerBench.testDeleteExpiringIds 1000 avgt 3 > 61430.530 ± 21875.644 ns/op > ProducerStateManagerBench.testDeleteExpiringIds 1 avgt 3 > 643887.031 ± 600475.302 ns/op > ProducerStateManagerBench.testDeleteExpiringIds 10 avgt 3 > 7741689.539 ± 3218317.079 ns/op > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16225) Flaky test suite LogDirFailureTest
[ https://issues.apache.org/jira/browse/KAFKA-16225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816260#comment-17816260 ] Justine Olshan commented on KAFKA-16225: Looks like it happened on 3.7 the same day which narrows it down to 3 commits? https://ge.apache.org/scans/tests?search.names=Git%20branch=P28D=kafka=America%2FLos_Angeles=3.7=kafka.server.LogDirFailureTest > Flaky test suite LogDirFailureTest > -- > > Key: KAFKA-16225 > URL: https://issues.apache.org/jira/browse/KAFKA-16225 > Project: Kafka > Issue Type: Bug > Components: core, unit tests >Reporter: Greg Harris >Priority: Major > Labels: flaky-test > > I see this failure on trunk and in PR builds for multiple methods in this > test suite: > {noformat} > org.opentest4j.AssertionFailedError: expected: but was: > at > org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151) > > at > org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132) > > at org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63) > at org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36) > at org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31) > at org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:179) > at kafka.utils.TestUtils$.causeLogDirFailure(TestUtils.scala:1715) > at > kafka.server.LogDirFailureTest.testProduceAfterLogDirFailureOnLeader(LogDirFailureTest.scala:186) > > at > kafka.server.LogDirFailureTest.testIOExceptionDuringLogRoll(LogDirFailureTest.scala:70){noformat} > It appears this assertion is failing > [https://github.com/apache/kafka/blob/f54975c33135140351c50370282e86c49c81bbdd/core/src/test/scala/unit/kafka/utils/TestUtils.scala#L1715] > The other error which is appearing is this: > {noformat} > org.opentest4j.AssertionFailedError: Unexpected exception type thrown, > expected: but was: > > at > org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151) > > at org.junit.jupiter.api.AssertThrows.assertThrows(AssertThrows.java:67) > at org.junit.jupiter.api.AssertThrows.assertThrows(AssertThrows.java:35) > at org.junit.jupiter.api.Assertions.assertThrows(Assertions.java:3111) > at > kafka.server.LogDirFailureTest.testProduceErrorsFromLogDirFailureOnLeader(LogDirFailureTest.scala:164) > > at > kafka.server.LogDirFailureTest.testProduceErrorFromFailureOnLogRoll(LogDirFailureTest.scala:64){noformat} > Failures appear to have started in this commit, but this does not indicate > that this commit is at fault: > [https://github.com/apache/kafka/tree/3d95a69a28c2d16e96618cfa9a1eb69180fb66ea] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16225) Flaky test suite LogDirFailureTest
[ https://issues.apache.org/jira/browse/KAFKA-16225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816258#comment-17816258 ] Justine Olshan commented on KAFKA-16225: I'm encountering this too. It looks pretty bad https://ge.apache.org/scans/tests?search.names=Git%20branch=P28D=kafka=America%2FLos_Angeles=trunk=kafka.server.LogDirFailureTest > Flaky test suite LogDirFailureTest > -- > > Key: KAFKA-16225 > URL: https://issues.apache.org/jira/browse/KAFKA-16225 > Project: Kafka > Issue Type: Bug > Components: core, unit tests >Reporter: Greg Harris >Priority: Major > Labels: flaky-test > > I see this failure on trunk and in PR builds for multiple methods in this > test suite: > {noformat} > org.opentest4j.AssertionFailedError: expected: but was: > at > org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151) > > at > org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132) > > at org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63) > at org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36) > at org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31) > at org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:179) > at kafka.utils.TestUtils$.causeLogDirFailure(TestUtils.scala:1715) > at > kafka.server.LogDirFailureTest.testProduceAfterLogDirFailureOnLeader(LogDirFailureTest.scala:186) > > at > kafka.server.LogDirFailureTest.testIOExceptionDuringLogRoll(LogDirFailureTest.scala:70){noformat} > It appears this assertion is failing > [https://github.com/apache/kafka/blob/f54975c33135140351c50370282e86c49c81bbdd/core/src/test/scala/unit/kafka/utils/TestUtils.scala#L1715] > The other error which is appearing is this: > {noformat} > org.opentest4j.AssertionFailedError: Unexpected exception type thrown, > expected: but was: > > at > org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151) > > at org.junit.jupiter.api.AssertThrows.assertThrows(AssertThrows.java:67) > at org.junit.jupiter.api.AssertThrows.assertThrows(AssertThrows.java:35) > at org.junit.jupiter.api.Assertions.assertThrows(Assertions.java:3111) > at > kafka.server.LogDirFailureTest.testProduceErrorsFromLogDirFailureOnLeader(LogDirFailureTest.scala:164) > > at > kafka.server.LogDirFailureTest.testProduceErrorFromFailureOnLogRoll(LogDirFailureTest.scala:64){noformat} > Failures appear to have started in this commit, but this does not indicate > that this commit is at fault: > [https://github.com/apache/kafka/tree/3d95a69a28c2d16e96618cfa9a1eb69180fb66ea] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-14920) Address timeouts and out of order sequences
[ https://issues.apache.org/jira/browse/KAFKA-14920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan updated KAFKA-14920: --- Description: KAFKA-14844 showed the destructive nature of a timeout on the first produce request for a topic partition (ie one that has no state in psm) Since we currently don't validate the first sequence (we will in part 2 of kip-890), any transient error on the first produce can lead to out of order sequences that never recover. Originally, KAFKA-14561 relied on the producer's retry mechanism for these transient issues, but until that is fixed, we may need to retry from in the AddPartitionsManager instead. We addressed the concurrent transactions, but there are other errors like coordinator loading that we could run into and see increased out of order issues. 由于我们目前尚未验证第一个序列(我们将在 kip-890 的第 2 部分中),因此第一个产品上的任何瞬态错误都可能导致永远无法恢复的无序序列。 最初,KAFKA-14561 依赖于生产者的重试机制来解决这些暂时性问题,但在修复之前,我们可能需要从 AddPartitionsManager 中重试。我们解决了并发事务,但还有其他错误,例如协调器加载,我们可能会遇到这些错误,并看到更多的乱序问题。 was: KAFKA-14844 showed the destructive nature of a timeout on the first produce request for a topic partition (ie one that has no state in psm) 由于我们目前尚未验证第一个序列(我们将在 kip-890 的第 2 部分中),因此第一个产品上的任何瞬态错误都可能导致永远无法恢复的无序序列。 最初,KAFKA-14561 依赖于生产者的重试机制来解决这些暂时性问题,但在修复之前,我们可能需要从 AddPartitionsManager 中重试。我们解决了并发事务,但还有其他错误,例如协调器加载,我们可能会遇到这些错误,并看到更多的乱序问题。 > Address timeouts and out of order sequences > --- > > Key: KAFKA-14920 > URL: https://issues.apache.org/jira/browse/KAFKA-14920 > Project: Kafka > Issue Type: Sub-task >Affects Versions: 3.6.0 >Reporter: Justine Olshan >Assignee: Justine Olshan >Priority: Blocker > Fix For: 3.6.0 > > > KAFKA-14844 showed the destructive nature of a timeout on the first produce > request for a topic partition (ie one that has no state in psm) > Since we currently don't validate the first sequence (we will in part 2 of > kip-890), any transient error on the first produce can lead to out of order > sequences that never recover. > Originally, KAFKA-14561 relied on the producer's retry mechanism for these > transient issues, but until that is fixed, we may need to retry from in the > AddPartitionsManager instead. We addressed the concurrent transactions, but > there are other errors like coordinator loading that we could run into and > see increased out of order issues. > 由于我们目前尚未验证第一个序列(我们将在 kip-890 的第 2 部分中),因此第一个产品上的任何瞬态错误都可能导致永远无法恢复的无序序列。 > 最初,KAFKA-14561 依赖于生产者的重试机制来解决这些暂时性问题,但在修复之前,我们可能需要从 AddPartitionsManager > 中重试。我们解决了并发事务,但还有其他错误,例如协调器加载,我们可能会遇到这些错误,并看到更多的乱序问题。 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (KAFKA-16217) Transactional producer stuck in IllegalStateException during close
[ https://issues.apache.org/jira/browse/KAFKA-16217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813432#comment-17813432 ] Justine Olshan edited comment on KAFKA-16217 at 2/2/24 1:36 AM: Thanks [~calvinliu] for filing this bug. With [https://github.com/apache/kafka/pull/13591] if we are in committing state and try to abort when closing, we catch an error on IllegalStateException. At this point, it is not possible to complete the transaction. Given that we can't move past this state, we should be able to close. The server will either abort the transaction due to timeout, or the new producer coming up with the same producer ID will abort the transaction. was (Author: jolshan): Thanks [~calvinliu] for filing this bug. With [https://github.com/apache/kafka/pull/13591] if we are in committing state and try to abort when closing, we will hit an IllegalStateException and enter fatal state. At this point, it is not possible to complete the transaction. Given that the fatal state means the producer must be closed, we should allow closing the producer in fatal state. The server will either abort the transaction due to timeout, or the new producer coming up with the same producer ID will abort the transaction. > Transactional producer stuck in IllegalStateException during close > -- > > Key: KAFKA-16217 > URL: https://issues.apache.org/jira/browse/KAFKA-16217 > Project: Kafka > Issue Type: Bug > Components: clients, producer >Affects Versions: 3.7.0, 3.6.1 >Reporter: Calvin Liu >Assignee: Kirk True >Priority: Major > Labels: transactions > Fix For: 3.6.2, 3.7.1 > > > The producer is stuck during the close. It keeps retrying to abort the > transaction but it never succeeds. > {code:java} > [ERROR] 2024-02-01 17:21:22,804 [kafka-producer-network-thread | > producer-transaction-bench-transaction-id-f60SGdyRQGGFjdgg3vUgKg] > org.apache.kafka.clients.producer.internals.Sender run - [Producer > clientId=producer-transaction-ben > ch-transaction-id-f60SGdyRQGGFjdgg3vUgKg, > transactionalId=transaction-bench-transaction-id-f60SGdyRQGGFjdgg3vUgKg] > Error in kafka producer I/O thread while aborting transaction: > java.lang.IllegalStateException: Cannot attempt operation `abortTransaction` > because the previous call to `commitTransaction` timed out and must be retried > at > org.apache.kafka.clients.producer.internals.TransactionManager.handleCachedTransactionRequestResult(TransactionManager.java:1138) > at > org.apache.kafka.clients.producer.internals.TransactionManager.beginAbort(TransactionManager.java:323) > at > org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:274) > at java.base/java.lang.Thread.run(Thread.java:1583) > at org.apache.kafka.common.utils.KafkaThread.run(KafkaThread.java:66) > {code} > With the additional log, I found the root cause. If the producer is in a bad > transaction state(in my case, the TransactionManager.pendingTransition was > set to commitTransaction and did not get cleaned), then the producer calls > close and tries to abort the existing transaction, the producer will get > stuck in the transaction abortion. It is related to the fix > [https://github.com/apache/kafka/pull/13591]. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16217) Transactional producer stuck in IllegalStateException during close
[ https://issues.apache.org/jira/browse/KAFKA-16217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan updated KAFKA-16217: --- Affects Version/s: 3.6.1 3.7.0 > Transactional producer stuck in IllegalStateException during close > -- > > Key: KAFKA-16217 > URL: https://issues.apache.org/jira/browse/KAFKA-16217 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 3.7.0, 3.6.1 >Reporter: Calvin Liu >Priority: Major > > The producer is stuck during the close. It keeps retrying to abort the > transaction but it never succeeds. > {code:java} > [ERROR] 2024-02-01 17:21:22,804 [kafka-producer-network-thread | > producer-transaction-bench-transaction-id-f60SGdyRQGGFjdgg3vUgKg] > org.apache.kafka.clients.producer.internals.Sender run - [Producer > clientId=producer-transaction-ben > ch-transaction-id-f60SGdyRQGGFjdgg3vUgKg, > transactionalId=transaction-bench-transaction-id-f60SGdyRQGGFjdgg3vUgKg] > Error in kafka producer I/O thread while aborting transaction: > java.lang.IllegalStateException: Cannot attempt operation `abortTransaction` > because the previous call to `commitTransaction` timed out and must be retried > at > org.apache.kafka.clients.producer.internals.TransactionManager.handleCachedTransactionRequestResult(TransactionManager.java:1138) > at > org.apache.kafka.clients.producer.internals.TransactionManager.beginAbort(TransactionManager.java:323) > at > org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:274) > at java.base/java.lang.Thread.run(Thread.java:1583) > at org.apache.kafka.common.utils.KafkaThread.run(KafkaThread.java:66) > {code} > With the additional log, I found the root cause. If the producer is in a bad > transaction state(in my case, the TransactionManager.pendingTransition was > set to commitTransaction and did not get cleaned), then the producer calls > close and tries to abort the existing transaction, the producer will get > stuck in the transaction abortion. It is related to the fix > [https://github.com/apache/kafka/pull/13591]. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16217) Transactional producer stuck in IllegalStateException during close
[ https://issues.apache.org/jira/browse/KAFKA-16217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813432#comment-17813432 ] Justine Olshan commented on KAFKA-16217: Thanks [~calvinliu] for filing this bug. With [https://github.com/apache/kafka/pull/13591] if we are in committing state and try to abort when closing, we will hit an IllegalStateException and enter fatal state. At this point, it is not possible to complete the transaction. Given that the fatal state means the producer must be closed, we should allow closing the producer in fatal state. The server will either abort the transaction due to timeout, or the new producer coming up with the same producer ID will abort the transaction. > Transactional producer stuck in IllegalStateException during close > -- > > Key: KAFKA-16217 > URL: https://issues.apache.org/jira/browse/KAFKA-16217 > Project: Kafka > Issue Type: Bug > Components: clients >Reporter: Calvin Liu >Priority: Major > > The producer is stuck during the close. It keeps retrying to abort the > transaction but it never succeeds. > {code:java} > [ERROR] 2024-02-01 17:21:22,804 [kafka-producer-network-thread | > producer-transaction-bench-transaction-id-f60SGdyRQGGFjdgg3vUgKg] > org.apache.kafka.clients.producer.internals.Sender run - [Producer > clientId=producer-transaction-ben > ch-transaction-id-f60SGdyRQGGFjdgg3vUgKg, > transactionalId=transaction-bench-transaction-id-f60SGdyRQGGFjdgg3vUgKg] > Error in kafka producer I/O thread while aborting transaction: > java.lang.IllegalStateException: Cannot attempt operation `abortTransaction` > because the previous call to `commitTransaction` timed out and must be retried > at > org.apache.kafka.clients.producer.internals.TransactionManager.handleCachedTransactionRequestResult(TransactionManager.java:1138) > at > org.apache.kafka.clients.producer.internals.TransactionManager.beginAbort(TransactionManager.java:323) > at > org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:274) > at java.base/java.lang.Thread.run(Thread.java:1583) > at org.apache.kafka.common.utils.KafkaThread.run(KafkaThread.java:66) > {code} > With the additional log, I found the root cause. If the producer is in a bad > transaction state(in my case, the TransactionManager.pendingTransition was > set to commitTransaction and did not get cleaned), then the producer calls > close and tries to abort the existing transaction, the producer will get > stuck in the transaction abortion. It is related to the fix > [https://github.com/apache/kafka/pull/13591]. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16212) Cache partitions by TopicIdPartition instead of TopicPartition
[ https://issues.apache.org/jira/browse/KAFKA-16212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17812822#comment-17812822 ] Justine Olshan commented on KAFKA-16212: When working on KIP-516 (topic IDs) there were many areas that topic IDs could help but changing them all in one go would have been difficult. I'm happy to see more steps in this direction :) > Cache partitions by TopicIdPartition instead of TopicPartition > -- > > Key: KAFKA-16212 > URL: https://issues.apache.org/jira/browse/KAFKA-16212 > Project: Kafka > Issue Type: Improvement >Affects Versions: 3.7.0 >Reporter: Gaurav Narula >Assignee: Omnia Ibrahim >Priority: Major > > From the discussion in [PR > 15263|https://github.com/apache/kafka/pull/15263#discussion_r1471075201], it > would be better to cache {{allPartitions}} by {{TopicIdPartition}} instead of > {{TopicPartition}} to avoid ambiguity. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16204) Stray file core/00000000000000000001.snapshot created when running core tests
[ https://issues.apache.org/jira/browse/KAFKA-16204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17811983#comment-17811983 ] Justine Olshan commented on KAFKA-16204: I've seen it when running just build + checkstyle + spotbugs. I didn't know if it was just me. > Stray file core/0001.snapshot created when running core tests > - > > Key: KAFKA-16204 > URL: https://issues.apache.org/jira/browse/KAFKA-16204 > Project: Kafka > Issue Type: Improvement > Components: core, unit tests >Reporter: Mickael Maison >Priority: Major > > When running the core tests I often get a file called > core/0001.snapshot created in my kafka folder. It looks like > one of the test does not clean its resources properly. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-16122) TransactionsBounceTest -- server disconnected before response was received
[ https://issues.apache.org/jira/browse/KAFKA-16122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan resolved KAFKA-16122. Resolution: Fixed > TransactionsBounceTest -- server disconnected before response was received > -- > > Key: KAFKA-16122 > URL: https://issues.apache.org/jira/browse/KAFKA-16122 > Project: Kafka > Issue Type: Test >Reporter: Justine Olshan >Assignee: Justine Olshan >Priority: Major > > I noticed a ton of tests failing with > h4. > {code:java} > Error org.apache.kafka.common.KafkaException: Unexpected error in > TxnOffsetCommitResponse: The server disconnected before a response was > received. {code} > {code:java} > Stacktrace org.apache.kafka.common.KafkaException: Unexpected error in > TxnOffsetCommitResponse: The server disconnected before a response was > received. at > app//org.apache.kafka.clients.producer.internals.TransactionManager$TxnOffsetCommitHandler.handleResponse(TransactionManager.java:1702) > at > app//org.apache.kafka.clients.producer.internals.TransactionManager$TxnRequestHandler.onComplete(TransactionManager.java:1236) > at > app//org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:154) > at > app//org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:608) > at app//org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:600) > at > app//org.apache.kafka.clients.producer.internals.Sender.maybeSendAndPollTransactionalRequest(Sender.java:457) > at > app//org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:334) > at > app//org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:249) > at java.base@21.0.1/java.lang.Thread.run(Thread.java:1583) {code} > The error indicates a network error which is retriable but the > TxnOffsetCommit handler doesn't expect this. > https://issues.apache.org/jira/browse/KAFKA-14417 addressed many of the other > requests but not this one. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-15987) Refactor ReplicaManager code for transaction verification
[ https://issues.apache.org/jira/browse/KAFKA-15987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan resolved KAFKA-15987. Resolution: Fixed > Refactor ReplicaManager code for transaction verification > - > > Key: KAFKA-15987 > URL: https://issues.apache.org/jira/browse/KAFKA-15987 > Project: Kafka > Issue Type: Sub-task >Reporter: Justine Olshan >Assignee: Justine Olshan >Priority: Major > > I started to do this in KAFKA-15784, but the diff was deemed too large and > confusing. I just wanted to file a followup ticket to reference this in code > for the areas that will be refactored. > > I hope to tackle it immediately after. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16192) Introduce usage of flexible records to coordinators
[ https://issues.apache.org/jira/browse/KAFKA-16192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan updated KAFKA-16192: --- Description: [KIP-915|https://cwiki.apache.org/confluence/display/KAFKA/KIP-915%3A+Txn+and+Group+Coordinator+Downgrade+Foundation] introduced flexible versions to the records used for the group and transaction coordinators. However, the KIP did not update the record version used. For [KIP-890|https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense] we intend to use flexible fields in the transaction state records. This requires a safe way to upgrade from non-flexible version records to flexible version records. This change is implemented via MV or through a new feature type. was: [KIP-915|https://cwiki.apache.org/confluence/display/KAFKA/KIP-915%3A+Txn+and+Group+Coordinator+Downgrade+Foundation] introduced flexible versions to the records used for the group and transaction coordinators. However, the KIP did not update the record version used. For [KIP-890|https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense] we intend to use flexible fields in the transaction state records. This requires a safe way to upgrade from non-flexible version records to flexible version records. This change is implemented via MV. > Introduce usage of flexible records to coordinators > --- > > Key: KAFKA-16192 > URL: https://issues.apache.org/jira/browse/KAFKA-16192 > Project: Kafka > Issue Type: Task >Reporter: Justine Olshan >Assignee: Justine Olshan >Priority: Major > > [KIP-915|https://cwiki.apache.org/confluence/display/KAFKA/KIP-915%3A+Txn+and+Group+Coordinator+Downgrade+Foundation] > introduced flexible versions to the records used for the group and > transaction coordinators. > However, the KIP did not update the record version used. > For > [KIP-890|https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense] > we intend to use flexible fields in the transaction state records. This > requires a safe way to upgrade from non-flexible version records to flexible > version records. > This change is implemented via MV or through a new feature type. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (KAFKA-16122) TransactionsBounceTest -- server disconnected before response was received
[ https://issues.apache.org/jira/browse/KAFKA-16122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan reassigned KAFKA-16122: -- Assignee: Justine Olshan > TransactionsBounceTest -- server disconnected before response was received > -- > > Key: KAFKA-16122 > URL: https://issues.apache.org/jira/browse/KAFKA-16122 > Project: Kafka > Issue Type: Test >Reporter: Justine Olshan >Assignee: Justine Olshan >Priority: Major > > I noticed a ton of tests failing with > h4. > {code:java} > Error org.apache.kafka.common.KafkaException: Unexpected error in > TxnOffsetCommitResponse: The server disconnected before a response was > received. {code} > {code:java} > Stacktrace org.apache.kafka.common.KafkaException: Unexpected error in > TxnOffsetCommitResponse: The server disconnected before a response was > received. at > app//org.apache.kafka.clients.producer.internals.TransactionManager$TxnOffsetCommitHandler.handleResponse(TransactionManager.java:1702) > at > app//org.apache.kafka.clients.producer.internals.TransactionManager$TxnRequestHandler.onComplete(TransactionManager.java:1236) > at > app//org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:154) > at > app//org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:608) > at app//org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:600) > at > app//org.apache.kafka.clients.producer.internals.Sender.maybeSendAndPollTransactionalRequest(Sender.java:457) > at > app//org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:334) > at > app//org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:249) > at java.base@21.0.1/java.lang.Thread.run(Thread.java:1583) {code} > The error indicates a network error which is retriable but the > TxnOffsetCommit handler doesn't expect this. > https://issues.apache.org/jira/browse/KAFKA-14417 addressed many of the other > requests but not this one. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16192) Introduce usage of flexible records to coordinators
[ https://issues.apache.org/jira/browse/KAFKA-16192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan updated KAFKA-16192: --- Description: [KIP-915|https://cwiki.apache.org/confluence/display/KAFKA/KIP-915%3A+Txn+and+Group+Coordinator+Downgrade+Foundation] introduced flexible versions to the records used for the group and transaction coordinators. However, the KIP did not update the record version used. For [KIP-890|https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense] we intend to use flexible fields in the transaction state records. This requires a safe way to upgrade from non-flexible version records to flexible version records. This change is implemented via MV. was: [KIP-915| https://cwiki.apache.org/confluence/display/KAFKA/KIP-915%3A+Txn+and+Group+Coordinator+Downgrade+Foundation] introduced flexible versions to the records used for the group and transaction coordinators. However, the KIP did not update the record version used. For [KIP-890|https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense] we intend to use flexible fields in the transaction state records. This requires a safe way to upgrade from non-flexible version records to flexible version records. Typically this is done as a message format bump. There may be an option to make this change using MV since if the readers of the records are internal and not external consumers. > Introduce usage of flexible records to coordinators > --- > > Key: KAFKA-16192 > URL: https://issues.apache.org/jira/browse/KAFKA-16192 > Project: Kafka > Issue Type: Task >Reporter: Justine Olshan >Assignee: Justine Olshan >Priority: Major > > [KIP-915|https://cwiki.apache.org/confluence/display/KAFKA/KIP-915%3A+Txn+and+Group+Coordinator+Downgrade+Foundation] > introduced flexible versions to the records used for the group and > transaction coordinators. > However, the KIP did not update the record version used. > For > [KIP-890|https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense] > we intend to use flexible fields in the transaction state records. This > requires a safe way to upgrade from non-flexible version records to flexible > version records. > This change is implemented via MV. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-16192) Introduce usage of flexible records to coordinators
Justine Olshan created KAFKA-16192: -- Summary: Introduce usage of flexible records to coordinators Key: KAFKA-16192 URL: https://issues.apache.org/jira/browse/KAFKA-16192 Project: Kafka Issue Type: Task Reporter: Justine Olshan Assignee: Justine Olshan [KIP-915| https://cwiki.apache.org/confluence/display/KAFKA/KIP-915%3A+Txn+and+Group+Coordinator+Downgrade+Foundation] introduced flexible versions to the records used for the group and transaction coordinators. However, the KIP did not update the record version used. For [KIP-890|https://cwiki.apache.org/confluence/display/KAFKA/KIP-890%3A+Transactions+Server-Side+Defense] we intend to use flexible fields in the transaction state records. This requires a safe way to upgrade from non-flexible version records to flexible version records. Typically this is done as a message format bump. There may be an option to make this change using MV since if the readers of the records are internal and not external consumers. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15104) Flaky test MetadataQuorumCommandTest for method testDescribeQuorumReplicationSuccessful
[ https://issues.apache.org/jira/browse/KAFKA-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17809648#comment-17809648 ] Justine Olshan commented on KAFKA-15104: I saw this fail many times here: [https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15183/3/tests] > Flaky test MetadataQuorumCommandTest for method > testDescribeQuorumReplicationSuccessful > --- > > Key: KAFKA-15104 > URL: https://issues.apache.org/jira/browse/KAFKA-15104 > Project: Kafka > Issue Type: Bug > Components: tools >Affects Versions: 3.5.0 >Reporter: Josep Prat >Priority: Major > Labels: flaky-test > > The MetadataQuorumCommandTest has become flaky on CI, I saw this failing: > org.apache.kafka.tools.MetadataQuorumCommandTest.[1] Type=Raft-Combined, > Name=testDescribeQuorumReplicationSuccessful, MetadataVersion=3.6-IV0, > Security=PLAINTEXT > Link to the CI: > https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-13865/2/testReport/junit/org.apache.kafka.tools/MetadataQuorumCommandTest/Build___JDK_8_and_Scala_2_121__Type_Raft_Combined__Name_testDescribeQuorumReplicationSuccessful__MetadataVersion_3_6_IV0__Security_PLAINTEXT/ > > h3. Error Message > {code:java} > java.util.concurrent.ExecutionException: java.lang.RuntimeException: Received > a fatal error while waiting for the controller to acknowledge that we are > caught up{code} > h3. Stacktrace > {code:java} > java.util.concurrent.ExecutionException: java.lang.RuntimeException: Received > a fatal error while waiting for the controller to acknowledge that we are > caught up at java.util.concurrent.FutureTask.report(FutureTask.java:122) at > java.util.concurrent.FutureTask.get(FutureTask.java:192) at > kafka.testkit.KafkaClusterTestKit.startup(KafkaClusterTestKit.java:419) at > kafka.test.junit.RaftClusterInvocationContext.lambda$getAdditionalExtensions$5(RaftClusterInvocationContext.java:115) > at > org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeBeforeTestExecutionCallbacks$5(TestMethodTestDescriptor.java:191) > at > org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeBeforeMethodsOrCallbacksUntilExceptionOccurs$6(TestMethodTestDescriptor.java:202) > at > org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73) > at > org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeBeforeMethodsOrCallbacksUntilExceptionOccurs(TestMethodTestDescriptor.java:202) > at > org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeBeforeTestExecutionCallbacks(TestMethodTestDescriptor.java:190) > at > org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:136){code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16156) System test failing for new consumer on endOffsets with negative timestamps
[ https://issues.apache.org/jira/browse/KAFKA-16156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17807858#comment-17807858 ] Justine Olshan commented on KAFKA-16156: The transactional copier is pretty old. We can probably update it. If I get time I can take a look. > System test failing for new consumer on endOffsets with negative timestamps > --- > > Key: KAFKA-16156 > URL: https://issues.apache.org/jira/browse/KAFKA-16156 > Project: Kafka > Issue Type: Sub-task > Components: clients >Reporter: Lianet Magrans >Priority: Major > > TransactionalMessageCopier run with 3.7 new consumer fails with "Invalid > negative timestamp". > Trace: > [2024-01-15 07:42:33,932] TRACE [Consumer > clientId=consumer-transactions-test-consumer-group-1, > groupId=transactions-test-consumer-group] Received ListOffsetResponse > ListOffsetsResponseData(throttleTimeMs=0, > topics=[ListOffsetsTopicResponse(name='input-topic', > partitions=[ListOffsetsPartitionResponse(partitionIndex=0, errorCode=0, > oldStyleOffsets=[], timestamp=-1, offset=42804, leaderEpoch=0)])]) from > broker worker2:9092 (id: 2 rack: null) > (org.apache.kafka.clients.consumer.internals.OffsetsRequestManager) > [2024-01-15 07:42:33,932] DEBUG [Consumer > clientId=consumer-transactions-test-consumer-group-1, > groupId=transactions-test-consumer-group] Handling ListOffsetResponse > response for input-topic-0. Fetched offset 42804, timestamp -1 > (org.apache.kafka.clients.consumer.internals.OffsetFetcherUtils) > [2024-01-15 07:42:33,932] TRACE [Consumer > clientId=consumer-transactions-test-consumer-group-1, > groupId=transactions-test-consumer-group] Updating last stable offset for > partition input-topic-0 to 42804 > (org.apache.kafka.clients.consumer.internals.OffsetFetcherUtils) > [2024-01-15 07:42:33,933] DEBUG [Consumer > clientId=consumer-transactions-test-consumer-group-1, > groupId=transactions-test-consumer-group] Fetch offsets completed > successfully for partitions and timestamps {input-topic-0=-1}. Result > org.apache.kafka.clients.consumer.internals.OffsetFetcherUtils$ListOffsetResult@bf2a862 > (org.apache.kafka.clients.consumer.internals.OffsetsRequestManager) > [2024-01-15 07:42:33,933] TRACE [Consumer > clientId=consumer-transactions-test-consumer-group-1, > groupId=transactions-test-consumer-group] No events to process > (org.apache.kafka.clients.consumer.internals.events.EventProcessor) > [2024-01-15 07:42:33,933] ERROR Shutting down after unexpected error in event > loop (org.apache.kafka.tools.TransactionalMessageCopier) > org.apache.kafka.common.KafkaException: java.lang.IllegalArgumentException: > Invalid negative timestamp > at > org.apache.kafka.clients.consumer.internals.ConsumerUtils.maybeWrapAsKafkaException(ConsumerUtils.java:234) > at > org.apache.kafka.clients.consumer.internals.ConsumerUtils.getResult(ConsumerUtils.java:212) > at > org.apache.kafka.clients.consumer.internals.events.CompletableApplicationEvent.get(CompletableApplicationEvent.java:44) > at > org.apache.kafka.clients.consumer.internals.events.ApplicationEventHandler.addAndGet(ApplicationEventHandler.java:113) > at > org.apache.kafka.clients.consumer.internals.AsyncKafkaConsumer.beginningOrEndOffset(AsyncKafkaConsumer.java:1134) > at > org.apache.kafka.clients.consumer.internals.AsyncKafkaConsumer.endOffsets(AsyncKafkaConsumer.java:1113) > at > org.apache.kafka.clients.consumer.internals.AsyncKafkaConsumer.endOffsets(AsyncKafkaConsumer.java:1108) > at > org.apache.kafka.clients.consumer.KafkaConsumer.endOffsets(KafkaConsumer.java:1651) > at > org.apache.kafka.tools.TransactionalMessageCopier.messagesRemaining(TransactionalMessageCopier.java:246) > at > org.apache.kafka.tools.TransactionalMessageCopier.runEventLoop(TransactionalMessageCopier.java:342) > at > org.apache.kafka.tools.TransactionalMessageCopier.main(TransactionalMessageCopier.java:292) > Caused by: java.lang.IllegalArgumentException: Invalid negative timestamp > at > org.apache.kafka.clients.consumer.OffsetAndTimestamp.(OffsetAndTimestamp.java:39) > at > org.apache.kafka.clients.consumer.internals.OffsetFetcherUtils.buildOffsetsForTimesResult(OffsetFetcherUtils.java:253) > at > org.apache.kafka.clients.consumer.internals.OffsetsRequestManager.lambda$fetchOffsets$1(OffsetsRequestManager.java:181) > at > java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602) > at > java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at >
[jira] [Created] (KAFKA-16122) TransactionsBounceTest -- server disconnected before response was received
Justine Olshan created KAFKA-16122: -- Summary: TransactionsBounceTest -- server disconnected before response was received Key: KAFKA-16122 URL: https://issues.apache.org/jira/browse/KAFKA-16122 Project: Kafka Issue Type: Test Reporter: Justine Olshan I noticed a ton of tests failing with h4. {code:java} Error org.apache.kafka.common.KafkaException: Unexpected error in TxnOffsetCommitResponse: The server disconnected before a response was received. {code} {code:java} Stacktrace org.apache.kafka.common.KafkaException: Unexpected error in TxnOffsetCommitResponse: The server disconnected before a response was received. at app//org.apache.kafka.clients.producer.internals.TransactionManager$TxnOffsetCommitHandler.handleResponse(TransactionManager.java:1702) at app//org.apache.kafka.clients.producer.internals.TransactionManager$TxnRequestHandler.onComplete(TransactionManager.java:1236) at app//org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:154) at app//org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:608) at app//org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:600) at app//org.apache.kafka.clients.producer.internals.Sender.maybeSendAndPollTransactionalRequest(Sender.java:457) at app//org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:334) at app//org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:249) at java.base@21.0.1/java.lang.Thread.run(Thread.java:1583) {code} The error indicates a network error which is retriable but the TxnOffsetCommit handler doesn't expect this. https://issues.apache.org/jira/browse/KAFKA-14417 addressed many of the other requests but not this one. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16120) Partition reassignments in ZK migration dual write leaves stray partitions
[ https://issues.apache.org/jira/browse/KAFKA-16120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17806129#comment-17806129 ] Justine Olshan commented on KAFKA-16120: Do you know if https://issues.apache.org/jira/browse/KAFKA-14616 is related? > Partition reassignments in ZK migration dual write leaves stray partitions > -- > > Key: KAFKA-16120 > URL: https://issues.apache.org/jira/browse/KAFKA-16120 > Project: Kafka > Issue Type: Bug >Reporter: David Mao >Priority: Major > > When a reassignment is completed in ZK migration dual-write mode, the > `StopReplica` sent by the kraft quorum migration propagator is sent with > `delete = false` for deleted replicas when processing the topic delta. This > results in stray replicas. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (KAFKA-16120) Partition reassignments in ZK migration dual write leaves stray partitions
[ https://issues.apache.org/jira/browse/KAFKA-16120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17806129#comment-17806129 ] Justine Olshan edited comment on KAFKA-16120 at 1/12/24 5:18 PM: - Do you know if https://issues.apache.org/jira/browse/KAFKA-14616 is related? Oh – I see this specifically for migration and I believe the other one is not. was (Author: jolshan): Do you know if https://issues.apache.org/jira/browse/KAFKA-14616 is related? > Partition reassignments in ZK migration dual write leaves stray partitions > -- > > Key: KAFKA-16120 > URL: https://issues.apache.org/jira/browse/KAFKA-16120 > Project: Kafka > Issue Type: Bug >Reporter: David Mao >Priority: Major > > When a reassignment is completed in ZK migration dual-write mode, the > `StopReplica` sent by the kraft quorum migration propagator is sent with > `delete = false` for deleted replicas when processing the topic delta. This > results in stray replicas. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16078) InterBrokerProtocolVersion defaults to non-production MetadataVersion
[ https://issues.apache.org/jira/browse/KAFKA-16078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17802308#comment-17802308 ] Justine Olshan commented on KAFKA-16078: We should really get to writing the KIP for the "non-production" vs "production" MVs. > InterBrokerProtocolVersion defaults to non-production MetadataVersion > - > > Key: KAFKA-16078 > URL: https://issues.apache.org/jira/browse/KAFKA-16078 > Project: Kafka > Issue Type: Bug >Reporter: David Arthur >Assignee: David Arthur >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17801029#comment-17801029 ] Justine Olshan commented on KAFKA-16052: Hey Luke – thanks for your help with the non-mocked replica manager. Did you mean to leave this comment about detecting leaking thread on a different jira? I think it is unrelated to the mock replica manager right? I will take a look at your and Divij's PRs to fix the replica manager. > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot > 2023-12-28 at 00.13.06.png, Screenshot 2023-12-28 at 00.18.56.png, Screenshot > 2023-12-28 at 11.26.03.png, Screenshot 2023-12-28 at 11.26.09.png, newRM.patch > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800911#comment-17800911 ] Justine Olshan commented on KAFKA-16052: What do we think about lowering the number of sequences we test (say from 10 to 3 or 4) and the number of groups from 10 to 5 or so. That would lower the number of operations from 35,000 to 7,000. This is used in both the GroupCoordinator and the TransactionCoordinator test too. (The transaction coordinator doesn't have groups, but transactions). We could also look at some of the other tests in the suite. (To be honest, I never really understood how this suite was actually testing concurrency with all the mocks and redefined operations) We probably need to figure out why we store all of these mock invocations since we are just executing the methods anyway. But I think lowering the number of operations should give us some breathing room for the OOMs. What do you think [~divijvaidya] [~ijuma] > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot > 2023-12-28 at 00.13.06.png, Screenshot 2023-12-28 at 00.18.56.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800909#comment-17800909 ] Justine Olshan edited comment on KAFKA-16052 at 12/28/23 2:35 AM: -- So I realized that every test runs thousands of operations that call not only replicaManager.maybeStartTransactionVerificationForPartition but also other replicaManager methods like replicaManager appendForGroup thousands of times. I'm trying to figure out why this is causing regressions now. For the testConcurrentRandomSequence test, we call these methods approximately 35,000 times. We do 10 random sequences and 10 ordered sequences. Each sequence consists of 7 operations for the 250 group members (nthreads(5) * 5 * 10). So 20 * 7 * 250 is 35,000. Not sure if we need this many operations! was (Author: jolshan): So I realized that every test runs thousands of operations that call not only replicaManager.maybeStartTransactionVerificationForPartition but also other replicaManager methods like replicaManager appendForGroup thousands of times. I'm trying to figure out why this is causing regressions now. > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot > 2023-12-28 at 00.13.06.png, Screenshot 2023-12-28 at 00.18.56.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800909#comment-17800909 ] Justine Olshan commented on KAFKA-16052: So I realized that every test runs thousands of operations that call not only replicaManager.maybeStartTransactionVerificationForPartition but also other replicaManager methods like replicaManager appendForGroup thousands of times. I'm trying to figure out why this is causing regressions now. > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot > 2023-12-28 at 00.13.06.png, Screenshot 2023-12-28 at 00.18.56.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800903#comment-17800903 ] Justine Olshan edited comment on KAFKA-16052 at 12/28/23 2:05 AM: -- It looks like the method that is actually being called is – replicaManager.maybeStartTransactionVerificationForPartition. This is overridden in TestReplicaManager to just do the callback. I'm not sure why it is storing all of these though. I will look into that next. was (Author: jolshan): It looks like the method that is actually being called is – replicaManager.maybeStartTransactionVerificationForPartition there. This is overridden to just do the callback. I'm not sure why it is storing all of these though. I will look into that next. > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot > 2023-12-28 at 00.13.06.png, Screenshot 2023-12-28 at 00.18.56.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800903#comment-17800903 ] Justine Olshan commented on KAFKA-16052: It looks like the method that is actually being called is – replicaManager.maybeStartTransactionVerificationForPartition there. This is overridden to just do the callback. I'm not sure why it is storing all of these though. I will look into that next. > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot > 2023-12-28 at 00.13.06.png, Screenshot 2023-12-28 at 00.18.56.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800902#comment-17800902 ] Justine Olshan commented on KAFKA-16052: Thanks Divij. This might be a few things then since the handleTxnCommitOffsets is not in the TransactionCoordinator test. I will look into this though. I believe the OOMs started before I changed the transactional offset commit flow, but perhaps it is still related to something else I changed. > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png, Screenshot > 2023-12-28 at 00.13.06.png, Screenshot 2023-12-28 at 00.18.56.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800880#comment-17800880 ] Justine Olshan commented on KAFKA-16052: There are only four tests in the group coordinator suite, so it is surprising to hit OOM on those. But trying the clearInlineMocks should be a low hanging fruit that could help a bit while we figure out why the mocks are causing so many issues in the first place. I have a draft PR with the change and we can look at how the builds do: https://github.com/apache/kafka/pull/15081 > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800877#comment-17800877 ] Justine Olshan commented on KAFKA-16052: Ah – good point [~ijuma]. I will look into that as well. > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800865#comment-17800865 ] Justine Olshan commented on KAFKA-16052: Taking a look at removing the mock as well. There are quite a few parts of the initialization that we don't need to do for the tests. For example: val replicaFetcherManager = createReplicaFetcherManager(metrics, time, threadNamePrefix, quotaManagers.follower) private[server] val replicaAlterLogDirsManager = createReplicaAlterLogDirsManager(quotaManagers.alterLogDirs, brokerTopicStats) We can do these if necessary, but it also starts other threads that we probably don't want running for the test. > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800855#comment-17800855 ] Justine Olshan edited comment on KAFKA-16052 at 12/27/23 7:20 PM: -- Doing a quick scan at the InterceptedInvocations, I see many with the method numDelayedDelete which is a method on DelayedOperations. I will look into that avenue for a bit. was (Author: jolshan): Doing a quick scan at the InterceptedInvocations, I see many with the method numDelayedDelete which is a method on DelayedOperations. I will look into that avenue for a bit. EDIT: It doesn't seem like we mock DelayedOperations in these tests, so not sure if this is something else. > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800855#comment-17800855 ] Justine Olshan edited comment on KAFKA-16052 at 12/27/23 7:19 PM: -- Doing a quick scan at the InterceptedInvocations, I see many with the method numDelayedDelete which is a method on DelayedOperations. I will look into that avenue for a bit. EDIT: It doesn't seem like we mock DelayedOperations in these tests, so not sure if this is something else. was (Author: jolshan): Doing a quick scan at the InterceptedInvocations, I see many with the method numDelayedDelete which is a method on DelayedOperations. I will look into that avenue for a bit. > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800855#comment-17800855 ] Justine Olshan commented on KAFKA-16052: Doing a quick scan at the InterceptedInvocations, I see many with the method numDelayedDelete which is a method on DelayedOperations. I will look into that avenue for a bit. > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800852#comment-17800852 ] Justine Olshan commented on KAFKA-16052: I can also take a look at the heap dump if that is helpful. > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-16052) OOM in Kafka test suite
[ https://issues.apache.org/jira/browse/KAFKA-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800851#comment-17800851 ] Justine Olshan commented on KAFKA-16052: Thanks Divij for the digging. These tests share the common AbstractCoordinatorCouncurrencyTest. I wonder if the mock in question is there. We do mock replica manager, but in an interesting way replicaManager = mock(classOf[TestReplicaManager], withSettings().defaultAnswer(CALLS_REAL_METHODS)) > OOM in Kafka test suite > --- > > Key: KAFKA-16052 > URL: https://issues.apache.org/jira/browse/KAFKA-16052 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.7.0 >Reporter: Divij Vaidya >Priority: Major > Attachments: Screenshot 2023-12-27 at 14.04.52.png, Screenshot > 2023-12-27 at 14.22.21.png, Screenshot 2023-12-27 at 14.45.20.png, Screenshot > 2023-12-27 at 15.31.09.png, Screenshot 2023-12-27 at 17.44.09.png > > > *Problem* > Our test suite is failing with frequent OOM. Discussion in the mailing list > is here: [https://lists.apache.org/thread/d5js0xpsrsvhgjb10mbzo9cwsy8087x4] > *Setup* > To find the source of leaks, I ran the :core:test build target with a single > thread (see below on how to do it) and attached a profiler to it. This Jira > tracks the list of action items identified from the analysis. > How to run tests using a single thread: > {code:java} > diff --git a/build.gradle b/build.gradle > index f7abbf4f0b..81df03f1ee 100644 > --- a/build.gradle > +++ b/build.gradle > @@ -74,9 +74,8 @@ ext { > "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" > )- maxTestForks = project.hasProperty('maxParallelForks') ? > maxParallelForks.toInteger() : Runtime.runtime.availableProcessors() > - maxScalacThreads = project.hasProperty('maxScalacThreads') ? > maxScalacThreads.toInteger() : > - Math.min(Runtime.runtime.availableProcessors(), 8) > + maxTestForks = 1 > + maxScalacThreads = 1 > userIgnoreFailures = project.hasProperty('ignoreFailures') ? > ignoreFailures : false userMaxTestRetries = > project.hasProperty('maxTestRetries') ? maxTestRetries.toInteger() : 0 > diff --git a/gradle.properties b/gradle.properties > index 4880248cac..ee4b6e3bc1 100644 > --- a/gradle.properties > +++ b/gradle.properties > @@ -30,4 +30,4 @@ scalaVersion=2.13.12 > swaggerVersion=2.2.8 > task=build > org.gradle.jvmargs=-Xmx2g -Xss4m -XX:+UseParallelGC > -org.gradle.parallel=true > +org.gradle.parallel=false {code} > *Result of experiment* > This is how the heap memory utilized looks like, starting from tens of MB to > ending with 1.5GB (with spikes of 2GB) of heap being used as the test > executes. Note that the total number of threads also increases but it does > not correlate with sharp increase in heap memory usage. The heap dump is > available at > [https://www.dropbox.com/scl/fi/nwtgc6ir6830xlfy9z9cu/GradleWorkerMain_10311_27_12_2023_13_37_08.hprof.zip?rlkey=ozbdgh5vih4rcynnxbatzk7ln=0] > > !Screenshot 2023-12-27 at 14.22.21.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-16012) Incomplete range assignment in consumer
[ https://issues.apache.org/jira/browse/KAFKA-16012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan resolved KAFKA-16012. Resolution: Fixed > Incomplete range assignment in consumer > --- > > Key: KAFKA-16012 > URL: https://issues.apache.org/jira/browse/KAFKA-16012 > Project: Kafka > Issue Type: Bug >Reporter: Jason Gustafson >Assignee: Philip Nee >Priority: Blocker > Fix For: 3.7.0 > > > We were looking into test failures here: > https://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/system-test-kafka-branch-builder--1702475525--jolshan--kafka-15784--7cad567675/2023-12-13--001./2023-12-13–001./report.html. > > Here is the first failure in the report: > {code:java} > > test_id: > kafkatest.tests.core.group_mode_transactions_test.GroupModeTransactionsTest.test_transactions.failure_mode=clean_bounce.bounce_target=brokers > status: FAIL > run time: 3 minutes 4.950 seconds > TimeoutError('Consumer consumed only 88223 out of 10 messages in > 90s') {code} > > We traced the failure to an apparent bug during the last rebalance before the > group became empty. The last remaining instance seems to receive an > incomplete assignment which prevents it from completing expected consumption > on some partitions. Here is the rebalance from the coordinator's perspective: > {code:java} > server.log.2023-12-13-04:[2023-12-13 04:58:56,987] INFO [GroupCoordinator 3]: > Stabilized group grouped-transactions-test-consumer-group generation 5 > (__consumer_offsets-2) with 1 members > (kafka.coordinator.group.GroupCoordinator) > server.log.2023-12-13-04:[2023-12-13 04:58:56,990] INFO [GroupCoordinator 3]: > Assignment received from leader > consumer-grouped-transactions-test-consumer-group-1-2164f472-93f3-4176-af3f-23d4ed8b37fd > for group grouped-transactions-test-consumer-group for generation 5. The > group has 1 members, 0 of which are static. > (kafka.coordinator.group.GroupCoordinator) {code} > The group is down to one member in generation 5. In the previous generation, > the consumer in question reported this assignment: > {code:java} > // Gen 4: we've got partitions 0-4 > [2023-12-13 04:58:52,631] DEBUG [Consumer > clientId=consumer-grouped-transactions-test-consumer-group-1, > groupId=grouped-transactions-test-consumer-group] Executing onJoinComplete > with generation 4 and memberId > consumer-grouped-transactions-test-consumer-group-1-2164f472-93f3-4176-af3f-23d4ed8b37fd > (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator) > [2023-12-13 04:58:52,631] INFO [Consumer > clientId=consumer-grouped-transactions-test-consumer-group-1, > groupId=grouped-transactions-test-consumer-group] Notifying assignor about > the new Assignment(partitions=[input-topic-0, input-topic-1, input-topic-2, > input-topic-3, input-topic-4]) > (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator) {code} > However, in generation 5, we seem to be assigned only one partition: > {code:java} > // Gen 5: Now we have only partition 1? But aren't we the last member in the > group? > [2023-12-13 04:58:56,954] DEBUG [Consumer > clientId=consumer-grouped-transactions-test-consumer-group-1, > groupId=grouped-transactions-test-consumer-group] Executing onJoinComplete > with generation 5 and memberId > consumer-grouped-transactions-test-consumer-group-1-2164f472-93f3-4176-af3f-23d4ed8b37fd > (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator) > [2023-12-13 04:58:56,955] INFO [Consumer > clientId=consumer-grouped-transactions-test-consumer-group-1, > groupId=grouped-transactions-test-consumer-group] Notifying assignor about > the new Assignment(partitions=[input-topic-1]) > (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator) {code} > The assignment type is range from the JoinGroup for generation 5. The decoded > metadata sent by the consumer is this: > {code:java} > Subscription(topics=[input-topic], ownedPartitions=[], groupInstanceId=null, > generationId=4, rackId=null) {code} > Here is the decoded assignment from the SyncGroup: > {code:java} > Assignment(partitions=[input-topic-1]) {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-16045) ZkMigrationIntegrationTest.testMigrateTopicDeletion flaky
[ https://issues.apache.org/jira/browse/KAFKA-16045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan updated KAFKA-16045: --- Labels: flaky-test (was: ) > ZkMigrationIntegrationTest.testMigrateTopicDeletion flaky > - > > Key: KAFKA-16045 > URL: https://issues.apache.org/jira/browse/KAFKA-16045 > Project: Kafka > Issue Type: Test >Reporter: Justine Olshan >Priority: Major > Labels: flaky-test > > I'm seeing ZkMigrationIntegrationTest.testMigrateTopicDeletion fail for many > builds. I believe it is also causing a thread leak because on most runs where > it fails I also see ReplicaManager tests also fail with extra threads. > The test always fails > `org.opentest4j.AssertionFailedError: Timed out waiting for topics to be > deleted` > gradle enterprise link: > [https://ge.apache.org/scans/tests?search.names=Git%20branch[…]lues=trunk=kafka.zk.ZkMigrationIntegrationTest|https://ge.apache.org/scans/tests?search.names=Git%20branch=P28D=kafka=America%2FLos_Angeles=trunk=kafka.zk.ZkMigrationIntegrationTest] > recent pr: > [https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15023/18/tests/] > trunk builds: > [https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/2502/tests], > > [https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/2501/tests] > (edited) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-16045) ZkMigrationIntegrationTest.testMigrateTopicDeletion flaky
Justine Olshan created KAFKA-16045: -- Summary: ZkMigrationIntegrationTest.testMigrateTopicDeletion flaky Key: KAFKA-16045 URL: https://issues.apache.org/jira/browse/KAFKA-16045 Project: Kafka Issue Type: Test Reporter: Justine Olshan I'm seeing ZkMigrationIntegrationTest.testMigrateTopicDeletion fail for many builds. I believe it is also causing a thread leak because on most runs where it fails I also see ReplicaManager tests also fail with extra threads. The test always fails `org.opentest4j.AssertionFailedError: Timed out waiting for topics to be deleted` gradle enterprise link: [https://ge.apache.org/scans/tests?search.names=Git%20branch[…]lues=trunk=kafka.zk.ZkMigrationIntegrationTest|https://ge.apache.org/scans/tests?search.names=Git%20branch=P28D=kafka=America%2FLos_Angeles=trunk=kafka.zk.ZkMigrationIntegrationTest] recent pr: [https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-15023/18/tests/] trunk builds: [https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/2502/tests], [https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka/detail/trunk/2501/tests] (edited) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-12670) KRaft support for unclean.leader.election.enable
[ https://issues.apache.org/jira/browse/KAFKA-12670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17798270#comment-17798270 ] Justine Olshan commented on KAFKA-12670: This ticket is a little stale. Kraft does support manual unclean leader election, but not automatic through the config. Here is the reply I also shared on the mailing list: When I asked some of the folks who worked on Kraft about this, they communicated to me that it was intentional to make unclean leader election a manual action. I think that for folks that want to prioritize availability over durability, the aggressive recovery strategy from KIP-966 should be preferable to the old unclean leader election configuration. [https://cwiki.apache.org/confluence/display/KAFKA/KIP-966%3A+Eligible+Leader+Replicas#KIP966:EligibleLeaderReplicas-Uncleanrecovery] Let me know if we don't think this is sufficient. > KRaft support for unclean.leader.election.enable > > > Key: KAFKA-12670 > URL: https://issues.apache.org/jira/browse/KAFKA-12670 > Project: Kafka > Issue Type: Task >Reporter: Colin McCabe >Assignee: Ryan Dielhenn >Priority: Major > > Implement KRaft support for the unclean.leader.election.enable > configurations. These configurations can be set at the topic, broker, or > cluster level. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-15784) Ensure atomicity of in memory update and write when transactionally committing offsets
[ https://issues.apache.org/jira/browse/KAFKA-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan resolved KAFKA-15784. Resolution: Fixed > Ensure atomicity of in memory update and write when transactionally > committing offsets > -- > > Key: KAFKA-15784 > URL: https://issues.apache.org/jira/browse/KAFKA-15784 > Project: Kafka > Issue Type: Sub-task >Affects Versions: 3.7.0 >Reporter: Justine Olshan >Assignee: Justine Olshan >Priority: Blocker > > [https://github.com/apache/kafka/pull/14370] (KAFKA-15449) removed the > locking around validating, updating state, and writing to the log > transactional offset commits. (The verification causes us to release the lock) > This was discovered in the discussion of > [https://github.com/apache/kafka/pull/14629] (KAFKA-15653). > Since KAFKA-15653 is needed for 3.5.1, it makes sense to address the locking > issue separately with this ticket. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-15987) Refactor ReplicaManager code for transaction verification
[ https://issues.apache.org/jira/browse/KAFKA-15987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan updated KAFKA-15987: --- Summary: Refactor ReplicaManager code for transaction verification (was: Refactor ReplicaManager code) > Refactor ReplicaManager code for transaction verification > - > > Key: KAFKA-15987 > URL: https://issues.apache.org/jira/browse/KAFKA-15987 > Project: Kafka > Issue Type: Sub-task >Reporter: Justine Olshan >Assignee: Justine Olshan >Priority: Major > > I started to do this in KAFKA-15784, but the diff was deemed too large and > confusing. I just wanted to file a followup ticket to reference this in code > for the areas that will be refactored. > > I hope to tackle it immediately after. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (KAFKA-15987) Refactor ReplicaManager code
[ https://issues.apache.org/jira/browse/KAFKA-15987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan reassigned KAFKA-15987: -- Assignee: Justine Olshan > Refactor ReplicaManager code > > > Key: KAFKA-15987 > URL: https://issues.apache.org/jira/browse/KAFKA-15987 > Project: Kafka > Issue Type: Sub-task >Reporter: Justine Olshan >Assignee: Justine Olshan >Priority: Major > > I started to do this in KAFKA-15784, but the diff was deemed too large and > confusing. I just wanted to file a followup ticket to reference this in code > for the areas that will be refactored. > > I hope to tackle it immediately after. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15987) Refactor ReplicaManager code
Justine Olshan created KAFKA-15987: -- Summary: Refactor ReplicaManager code Key: KAFKA-15987 URL: https://issues.apache.org/jira/browse/KAFKA-15987 Project: Kafka Issue Type: Sub-task Reporter: Justine Olshan I started to do this in KAFKA-15784, but the diff was deemed too large and confusing. I just wanted to file a followup ticket to reference this in code for the areas that will be refactored. I hope to tackle it immediately after. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15984) Client disconnections can cause hanging transactions on __consumer_offsets
Justine Olshan created KAFKA-15984: -- Summary: Client disconnections can cause hanging transactions on __consumer_offsets Key: KAFKA-15984 URL: https://issues.apache.org/jira/browse/KAFKA-15984 Project: Kafka Issue Type: Task Reporter: Justine Olshan When investigating frequent hanging transactions on __consumer_offsets partitions, we realized that many of them were cause by the same offset being committed with duplicates and one with `"isDisconnectedClient":true`. TxnOffsetCommits do not have sequence numbers and thus are not protected against duplicates in the same way idempotent produce requests are. Thus, when a client is disconnected (and flushes its requests), we may see the duplicate get appended to the log. KIP-890 part 1 should protect against this as the duplicate will not succeed verification. KIP-890 part 2 strengthens this further as duplicates (from previous transactions) can not be added to new transactions if the partitions is re-added since the epoch will be bumped. Another possible solution is to do duplicate checking on the group coordinator side when the request comes in. This solution could be used instead of KIP-890 part 1 to prevent hanging transactions but given that part 1 only has one open PR remaining, we may not need to do this. However, this can also prevent duplicates from being added to a new transaction – something only part 2 will protect against. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15975) Update kafka quickstart guide to no longer list ZK start first
Justine Olshan created KAFKA-15975: -- Summary: Update kafka quickstart guide to no longer list ZK start first Key: KAFKA-15975 URL: https://issues.apache.org/jira/browse/KAFKA-15975 Project: Kafka Issue Type: Task Components: docs Affects Versions: 4.0.0 Reporter: Justine Olshan Given we are deprecating ZooKeeper, I think we should update our quickstart guide to not list the ZooKeeper instructions first. With 4.0, we may want to remove it entirely. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15957) ConsistencyVectorIntegrationTest.shouldHaveSamePositionBoundActiveAndStandBy broken
[ https://issues.apache.org/jira/browse/KAFKA-15957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792268#comment-17792268 ] Justine Olshan commented on KAFKA-15957: [https://github.com/apache/kafka/pull/14895] is fixing this > ConsistencyVectorIntegrationTest.shouldHaveSamePositionBoundActiveAndStandBy > broken > --- > > Key: KAFKA-15957 > URL: https://issues.apache.org/jira/browse/KAFKA-15957 > Project: Kafka > Issue Type: Bug >Reporter: Justine Olshan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15957) ConsistencyVectorIntegrationTest.shouldHaveSamePositionBoundActiveAndStandBy broken
Justine Olshan created KAFKA-15957: -- Summary: ConsistencyVectorIntegrationTest.shouldHaveSamePositionBoundActiveAndStandBy broken Key: KAFKA-15957 URL: https://issues.apache.org/jira/browse/KAFKA-15957 Project: Kafka Issue Type: Bug Reporter: Justine Olshan -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (KAFKA-15947) Null pointer on LZ4 compression since Kafka 3.6
[ https://issues.apache.org/jira/browse/KAFKA-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17791332#comment-17791332 ] Justine Olshan edited comment on KAFKA-15947 at 11/29/23 9:41 PM: -- Hey – is this the same as https://issues.apache.org/jira/browse/KAFKA-15653? Are you running with transactions? The above ticket will be fixed in 3.6.1 was (Author: jolshan): Hey – is this the same as https://issues.apache.org/jira/browse/KAFKA-15653? Are you running with transactions? > Null pointer on LZ4 compression since Kafka 3.6 > --- > > Key: KAFKA-15947 > URL: https://issues.apache.org/jira/browse/KAFKA-15947 > Project: Kafka > Issue Type: Bug > Components: compression >Affects Versions: 3.6.0 >Reporter: Ludo >Priority: Major > > I have a Kafka Stream application running well since month using client > version {{3.5.1 }}with 3.5.1 (bitnami image: {{bitnami/3.5.1-debian-11-r44)}} > using{{ compression.type: "lz4"}} > I've recently updated a my kafka server to kafka 3.6 (bitnami image: > {{{}bitnami/kafka:3.6.0-debian-11-r0){}}}. > > The startup is working well for days, and after some time, Kafka Stream crash > and Kafka output a lot of NullPointerException on the console: > > {code:java} > org.apache.kafka.common.KafkaException: java.lang.NullPointerException: > Cannot invoke "java.nio.ByteBuffer.hasArray()" because > "this.intermediateBufRef" is null > at > org.apache.kafka.common.record.CompressionType$4.wrapForInput(CompressionType.java:134) > at > org.apache.kafka.common.record.DefaultRecordBatch.recordInputStream(DefaultRecordBatch.java:273) > at > org.apache.kafka.common.record.DefaultRecordBatch.compressedIterator(DefaultRecordBatch.java:277) > at > org.apache.kafka.common.record.DefaultRecordBatch.skipKeyValueIterator(DefaultRecordBatch.java:352) > at > org.apache.kafka.storage.internals.log.LogValidator.validateMessagesAndAssignOffsetsCompressed(LogValidator.java:358) > at > org.apache.kafka.storage.internals.log.LogValidator.validateMessagesAndAssignOffsets(LogValidator.java:165) > at kafka.log.UnifiedLog.$anonfun$append$2(UnifiedLog.scala:805) > at kafka.log.UnifiedLog.append(UnifiedLog.scala:1845) > at kafka.log.UnifiedLog.appendAsLeader(UnifiedLog.scala:719) > at > kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:1313) > at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:1301) > at > kafka.server.ReplicaManager.$anonfun$appendToLocalLog$6(ReplicaManager.scala:1210) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > at > scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) > at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) > at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) > at scala.collection.mutable.HashMap.foreach(HashMap.scala:149) > at scala.collection.TraversableLike.map(TraversableLike.scala:286) > at scala.collection.TraversableLike.map$(TraversableLike.scala:279) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:1198) > at > kafka.server.ReplicaManager.$anonfun$appendRecords$18$adapted(ReplicaManager.scala:754) > at > kafka.server.KafkaRequestHandler$.$anonfun$wrap$3(KafkaRequestHandler.scala:73) > at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:130) > at java.base/java.lang.Thread.run(Thread.java:833) > Caused by: java.lang.NullPointerException: Cannot invoke > "java.nio.ByteBuffer.hasArray()" because "this.intermediateBufRef" is null > at > org.apache.kafka.common.utils.ChunkedBytesStream.(ChunkedBytesStream.java:89) > at > org.apache.kafka.common.record.CompressionType$4.wrapForInput(CompressionType.java:132) > ... 25 more {code} > At the same time the Kafka Stream raise this error: > > {code:java} > org.apache.kafka.streams.errors.StreamsException: Error encountered sending > record to topic kestra_workertaskresult for task 3_6 due > to:org.apache.kafka.common.errors.UnknownServerException: The server > experienced an unexpected error when processing the request.Written offsets > would not be recorded and no more records would be sent since this is a fatal > error.at > org.apache.kafka.streams.processor.internals.RecordCollectorImpl.recordSendError(RecordCollectorImpl.java:297)at > > org.apache.kafka.streams.processor.internals.RecordCollectorImpl.lambda$send$1(RecordCollectorImpl.java:284)at > >
[jira] [Commented] (KAFKA-15947) Null pointer on LZ4 compression since Kafka 3.6
[ https://issues.apache.org/jira/browse/KAFKA-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17791332#comment-17791332 ] Justine Olshan commented on KAFKA-15947: Hey – is this the same as https://issues.apache.org/jira/browse/KAFKA-15653? Are you running with transactions? > Null pointer on LZ4 compression since Kafka 3.6 > --- > > Key: KAFKA-15947 > URL: https://issues.apache.org/jira/browse/KAFKA-15947 > Project: Kafka > Issue Type: Bug > Components: compression >Affects Versions: 3.6.0 >Reporter: Ludo >Priority: Major > > I have a Kafka Stream application running well since month using client > version {{3.5.1 }}with 3.5.1 (bitnami image: {{bitnami/3.5.1-debian-11-r44)}} > using{{ compression.type: "lz4"}} > I've recently updated a my kafka server to kafka 3.6 (bitnami image: > {{{}bitnami/kafka:3.6.0-debian-11-r0){}}}. > > The startup is working well for days, and after some time, Kafka Stream crash > and Kafka output a lot of NullPointerException on the console: > > {code:java} > org.apache.kafka.common.KafkaException: java.lang.NullPointerException: > Cannot invoke "java.nio.ByteBuffer.hasArray()" because > "this.intermediateBufRef" is null > at > org.apache.kafka.common.record.CompressionType$4.wrapForInput(CompressionType.java:134) > at > org.apache.kafka.common.record.DefaultRecordBatch.recordInputStream(DefaultRecordBatch.java:273) > at > org.apache.kafka.common.record.DefaultRecordBatch.compressedIterator(DefaultRecordBatch.java:277) > at > org.apache.kafka.common.record.DefaultRecordBatch.skipKeyValueIterator(DefaultRecordBatch.java:352) > at > org.apache.kafka.storage.internals.log.LogValidator.validateMessagesAndAssignOffsetsCompressed(LogValidator.java:358) > at > org.apache.kafka.storage.internals.log.LogValidator.validateMessagesAndAssignOffsets(LogValidator.java:165) > at kafka.log.UnifiedLog.$anonfun$append$2(UnifiedLog.scala:805) > at kafka.log.UnifiedLog.append(UnifiedLog.scala:1845) > at kafka.log.UnifiedLog.appendAsLeader(UnifiedLog.scala:719) > at > kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:1313) > at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:1301) > at > kafka.server.ReplicaManager.$anonfun$appendToLocalLog$6(ReplicaManager.scala:1210) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > at > scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) > at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) > at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) > at scala.collection.mutable.HashMap.foreach(HashMap.scala:149) > at scala.collection.TraversableLike.map(TraversableLike.scala:286) > at scala.collection.TraversableLike.map$(TraversableLike.scala:279) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:1198) > at > kafka.server.ReplicaManager.$anonfun$appendRecords$18$adapted(ReplicaManager.scala:754) > at > kafka.server.KafkaRequestHandler$.$anonfun$wrap$3(KafkaRequestHandler.scala:73) > at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:130) > at java.base/java.lang.Thread.run(Thread.java:833) > Caused by: java.lang.NullPointerException: Cannot invoke > "java.nio.ByteBuffer.hasArray()" because "this.intermediateBufRef" is null > at > org.apache.kafka.common.utils.ChunkedBytesStream.(ChunkedBytesStream.java:89) > at > org.apache.kafka.common.record.CompressionType$4.wrapForInput(CompressionType.java:132) > ... 25 more {code} > At the same time the Kafka Stream raise this error: > > {code:java} > org.apache.kafka.streams.errors.StreamsException: Error encountered sending > record to topic kestra_workertaskresult for task 3_6 due > to:org.apache.kafka.common.errors.UnknownServerException: The server > experienced an unexpected error when processing the request.Written offsets > would not be recorded and no more records would be sent since this is a fatal > error.at > org.apache.kafka.streams.processor.internals.RecordCollectorImpl.recordSendError(RecordCollectorImpl.java:297)at > > org.apache.kafka.streams.processor.internals.RecordCollectorImpl.lambda$send$1(RecordCollectorImpl.java:284)at > > org.apache.kafka.clients.producer.KafkaProducer$AppendCallbacks.onCompletion(KafkaProducer.java:1505)at > > org.apache.kafka.clients.producer.internals.ProducerBatch.completeFutureAndFireCallbacks(ProducerBatch.java:273)at > >
[jira] [Commented] (KAFKA-15653) NPE in ChunkedByteStream
[ https://issues.apache.org/jira/browse/KAFKA-15653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17786338#comment-17786338 ] Justine Olshan commented on KAFKA-15653: Yes. [~h...@pinterest.com] this is resolved. KAFKA-15784 which was discovered while working on this is still open, but I'm working on it. > NPE in ChunkedByteStream > > > Key: KAFKA-15653 > URL: https://issues.apache.org/jira/browse/KAFKA-15653 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 3.6.0 > Environment: Docker container on a Linux laptop, using the latest > release. >Reporter: Travis Bischel >Assignee: Justine Olshan >Priority: Major > Fix For: 3.7.0, 3.6.1 > > Attachments: repro.sh > > > When looping franz-go integration tests, I received an UNKNOWN_SERVER_ERROR > from producing. The broker logs for the failing request: > > {noformat} > [2023-10-19 22:29:58,160] ERROR [ReplicaManager broker=2] Error processing > append operation on partition > 2fa8995d8002fbfe68a96d783f26aa2c5efc15368bf44ed8f2ab7e24b41b9879-24 > (kafka.server.ReplicaManager) > java.lang.NullPointerException > at > org.apache.kafka.common.utils.ChunkedBytesStream.(ChunkedBytesStream.java:89) > at > org.apache.kafka.common.record.CompressionType$3.wrapForInput(CompressionType.java:105) > at > org.apache.kafka.common.record.DefaultRecordBatch.recordInputStream(DefaultRecordBatch.java:273) > at > org.apache.kafka.common.record.DefaultRecordBatch.compressedIterator(DefaultRecordBatch.java:277) > at > org.apache.kafka.common.record.DefaultRecordBatch.skipKeyValueIterator(DefaultRecordBatch.java:352) > at > org.apache.kafka.storage.internals.log.LogValidator.validateMessagesAndAssignOffsetsCompressed(LogValidator.java:358) > at > org.apache.kafka.storage.internals.log.LogValidator.validateMessagesAndAssignOffsets(LogValidator.java:165) > at kafka.log.UnifiedLog.append(UnifiedLog.scala:805) > at kafka.log.UnifiedLog.appendAsLeader(UnifiedLog.scala:719) > at > kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:1313) > at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:1301) > at > kafka.server.ReplicaManager.$anonfun$appendToLocalLog$6(ReplicaManager.scala:1210) > at > scala.collection.StrictOptimizedMapOps.map(StrictOptimizedMapOps.scala:28) > at > scala.collection.StrictOptimizedMapOps.map$(StrictOptimizedMapOps.scala:27) > at scala.collection.mutable.HashMap.map(HashMap.scala:35) > at > kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:1198) > at kafka.server.ReplicaManager.appendEntries$1(ReplicaManager.scala:754) > at > kafka.server.ReplicaManager.$anonfun$appendRecords$18(ReplicaManager.scala:874) > at > kafka.server.ReplicaManager.$anonfun$appendRecords$18$adapted(ReplicaManager.scala:874) > at > kafka.server.KafkaRequestHandler$.$anonfun$wrap$3(KafkaRequestHandler.scala:73) > at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:130) > at java.base/java.lang.Thread.run(Unknown Source) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-15653) NPE in ChunkedByteStream
[ https://issues.apache.org/jira/browse/KAFKA-15653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Justine Olshan resolved KAFKA-15653. Fix Version/s: 3.7.0 3.6.1 Resolution: Fixed > NPE in ChunkedByteStream > > > Key: KAFKA-15653 > URL: https://issues.apache.org/jira/browse/KAFKA-15653 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 3.6.0 > Environment: Docker container on a Linux laptop, using the latest > release. >Reporter: Travis Bischel >Assignee: Justine Olshan >Priority: Major > Fix For: 3.7.0, 3.6.1 > > Attachments: repro.sh > > > When looping franz-go integration tests, I received an UNKNOWN_SERVER_ERROR > from producing. The broker logs for the failing request: > > {noformat} > [2023-10-19 22:29:58,160] ERROR [ReplicaManager broker=2] Error processing > append operation on partition > 2fa8995d8002fbfe68a96d783f26aa2c5efc15368bf44ed8f2ab7e24b41b9879-24 > (kafka.server.ReplicaManager) > java.lang.NullPointerException > at > org.apache.kafka.common.utils.ChunkedBytesStream.(ChunkedBytesStream.java:89) > at > org.apache.kafka.common.record.CompressionType$3.wrapForInput(CompressionType.java:105) > at > org.apache.kafka.common.record.DefaultRecordBatch.recordInputStream(DefaultRecordBatch.java:273) > at > org.apache.kafka.common.record.DefaultRecordBatch.compressedIterator(DefaultRecordBatch.java:277) > at > org.apache.kafka.common.record.DefaultRecordBatch.skipKeyValueIterator(DefaultRecordBatch.java:352) > at > org.apache.kafka.storage.internals.log.LogValidator.validateMessagesAndAssignOffsetsCompressed(LogValidator.java:358) > at > org.apache.kafka.storage.internals.log.LogValidator.validateMessagesAndAssignOffsets(LogValidator.java:165) > at kafka.log.UnifiedLog.append(UnifiedLog.scala:805) > at kafka.log.UnifiedLog.appendAsLeader(UnifiedLog.scala:719) > at > kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:1313) > at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:1301) > at > kafka.server.ReplicaManager.$anonfun$appendToLocalLog$6(ReplicaManager.scala:1210) > at > scala.collection.StrictOptimizedMapOps.map(StrictOptimizedMapOps.scala:28) > at > scala.collection.StrictOptimizedMapOps.map$(StrictOptimizedMapOps.scala:27) > at scala.collection.mutable.HashMap.map(HashMap.scala:35) > at > kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:1198) > at kafka.server.ReplicaManager.appendEntries$1(ReplicaManager.scala:754) > at > kafka.server.ReplicaManager.$anonfun$appendRecords$18(ReplicaManager.scala:874) > at > kafka.server.ReplicaManager.$anonfun$appendRecords$18$adapted(ReplicaManager.scala:874) > at > kafka.server.KafkaRequestHandler$.$anonfun$wrap$3(KafkaRequestHandler.scala:73) > at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:130) > at java.base/java.lang.Thread.run(Unknown Source) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-14806) Add connection timeout in PlaintextSender used by SelectorTests
[ https://issues.apache.org/jira/browse/KAFKA-14806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784619#comment-17784619 ] Justine Olshan commented on KAFKA-14806: Had the same sort of failures on my run recently: [https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-14712/6/] Funny how the SocketServerTests also all fail too. > Add connection timeout in PlaintextSender used by SelectorTests > --- > > Key: KAFKA-14806 > URL: https://issues.apache.org/jira/browse/KAFKA-14806 > Project: Kafka > Issue Type: Test >Reporter: Alexandre Dupriez >Assignee: Alexandre Dupriez >Priority: Minor > > Tests in {{SelectorTest}} can fail due to spurious connection timeouts. One > example can be found in [this > build|https://github.com/apache/kafka/pull/13378/checks?check_run_id=11970595528] > where the client connection the {{PlaintextSender}} tried to open could not > be established before the test timed out. > It may be worth enforcing connection timeout and retries if this can add to > the selector tests resiliency. Note that {{PlaintextSender}} is only used by > the {{SelectorTest}} so the scope of the change would remain local. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15798) Flaky Test NamedTopologyIntegrationTest.shouldAddAndRemoveNamedTopologiesBeforeStartingAndRouteQueriesToCorrectTopology()
Justine Olshan created KAFKA-15798: -- Summary: Flaky Test NamedTopologyIntegrationTest.shouldAddAndRemoveNamedTopologiesBeforeStartingAndRouteQueriesToCorrectTopology() Key: KAFKA-15798 URL: https://issues.apache.org/jira/browse/KAFKA-15798 Project: Kafka Issue Type: Bug Reporter: Justine Olshan I saw a few examples recently. 2 have the same error, but the third is different [https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-14629/22/testReport/junit/org.apache.kafka.streams.integration/NamedTopologyIntegrationTest/Build___JDK_8_and_Scala_2_12___shouldAddAndRemoveNamedTopologiesBeforeStartingAndRouteQueriesToCorrectTopology___2/] [https://ci-builds.apache.org/job/Kafka/job/kafka/job/trunk/2365/testReport/junit/org.apache.kafka.streams.integration/NamedTopologyIntegrationTest/Build___JDK_21_and_Scala_2_13___shouldAddAndRemoveNamedTopologiesBeforeStartingAndRouteQueriesToCorrectTopology__/] The failure is like {code:java} java.lang.AssertionError: Did not receive all 5 records from topic output-stream-1 within 6 ms, currently accumulated data is [] Expected: is a value equal to or greater than <5> but: <0> was less than <5>{code} The other failure was [https://ci-builds.apache.org/job/Kafka/job/kafka/job/trunk/2365/testReport/junit/org.apache.kafka.streams.integration/NamedTopologyIntegrationTest/Build___JDK_8_and_Scala_2_12___shouldAddAndRemoveNamedTopologiesBeforeStartingAndRouteQueriesToCorrectTopology__/] {code:java} java.lang.AssertionError: Expected: <[0, 1]> but: was <[0]>{code} -- This message was sent by Atlassian Jira (v8.20.10#820010)