[jira] [Resolved] (KAFKA-15510) Follower's lastFetchedEpoch wrongly set when fetch response has no record
[ https://issues.apache.org/jira/browse/KAFKA-15510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram resolved KAFKA-15510. Fix Version/s: 3.7.0 Resolution: Fixed > Follower's lastFetchedEpoch wrongly set when fetch response has no record > - > > Key: KAFKA-15510 > URL: https://issues.apache.org/jira/browse/KAFKA-15510 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.6.0, 3.5.1 >Reporter: Chern Yih Cheah >Assignee: Chern Yih Cheah >Priority: Major > Fix For: 3.7.0 > > > A regression is introduced by > [https://github.com/apache/kafka/pull/13843/files#diff-508e9dc4d52744119dda36d69ce63a1901abfd3080ca72fc4554250b7e9f5242.|https://github.com/apache/kafka/pull/13843/files#diff-508e9dc4d52744119dda36d69ce63a1901abfd3080ca72fc4554250b7e9f5242] > When the fetch response has no record for a partition, validBytes is 0. In > this case, we shouldn't set the last fetch epoch to > logAppendInfo.lastLeaderEpoch.asScala since there is no record and it is > Optional.empty. We should use currentFetchState.lastFetchedEpoch instead. > An effect of this is truncation of fetch might not work correctly. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (KAFKA-15510) Follower's lastFetchedEpoch wrongly set when fetch response has no record
[ https://issues.apache.org/jira/browse/KAFKA-15510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770051#comment-17770051 ] Rajini Sivaram commented on KAFKA-15510: This does not currently impact truncation in followers, but it seems useful to fix the regression to avoid breaking in future, if the timing scenario for truncation changes. > Follower's lastFetchedEpoch wrongly set when fetch response has no record > - > > Key: KAFKA-15510 > URL: https://issues.apache.org/jira/browse/KAFKA-15510 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.6.0, 3.5.1 >Reporter: Chern Yih Cheah >Assignee: Chern Yih Cheah >Priority: Major > > A regression is introduced by > [https://github.com/apache/kafka/pull/13843/files#diff-508e9dc4d52744119dda36d69ce63a1901abfd3080ca72fc4554250b7e9f5242.|https://github.com/apache/kafka/pull/13843/files#diff-508e9dc4d52744119dda36d69ce63a1901abfd3080ca72fc4554250b7e9f5242] > When the fetch response has no record for a partition, validBytes is 0. In > this case, we shouldn't set the last fetch epoch to > logAppendInfo.lastLeaderEpoch.asScala since there is no record and it is > Optional.empty. We should use currentFetchState.lastFetchedEpoch instead. > An effect of this is truncation of fetch might not work correctly. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-15117) SslTransportLayerTest.testValidEndpointIdentificationCN fails with Java 20 & 21
[ https://issues.apache.org/jira/browse/KAFKA-15117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram resolved KAFKA-15117. Reviewer: Rajini Sivaram Resolution: Fixed > SslTransportLayerTest.testValidEndpointIdentificationCN fails with Java 20 & > 21 > --- > > Key: KAFKA-15117 > URL: https://issues.apache.org/jira/browse/KAFKA-15117 > Project: Kafka > Issue Type: Bug >Reporter: Ismael Juma >Assignee: Purshotam Chauhan >Priority: Major > Fix For: 3.7.0 > > > All variations fail as seen below. These tests have been disabled when run > with Java 20 & 21 for now. > {code:java} > Gradle Test Run :clients:test > Gradle Test Executor 12 > > SslTransportLayerTest > testValidEndpointIdentificationCN(Args) > [1] > tlsProtocol=TLSv1.2, useInlinePem=false FAILED > org.opentest4j.AssertionFailedError: Channel 0 was not ready after 30 > seconds ==> expected: but was: > at > app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151) > at > app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132) > at > app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63) > at > app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36) > at > app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:211) > at > app//org.apache.kafka.common.network.NetworkTestUtils.waitForChannelReady(NetworkTestUtils.java:107) > at > app//org.apache.kafka.common.network.NetworkTestUtils.checkClientConnection(NetworkTestUtils.java:70) > at > app//org.apache.kafka.common.network.SslTransportLayerTest.verifySslConfigs(SslTransportLayerTest.java:1296) > at > app//org.apache.kafka.common.network.SslTransportLayerTest.testValidEndpointIdentificationCN(SslTransportLayerTest.java:202) > org.apache.kafka.common.network.SslTransportLayerTest.testValidEndpointIdentificationCN(Args)[2] > failed, log available in > /home/ijuma/src/kafka/clients/build/reports/testOutput/org.apache.kafka.common.network.SslTransportLayerTest.testValidEndpointIdentificationCN(Args)[2].test.stdout > Gradle Test Run :clients:test > Gradle Test Executor 12 > > SslTransportLayerTest > testValidEndpointIdentificationCN(Args) > [2] > tlsProtocol=TLSv1.2, useInlinePem=true FAILED > org.opentest4j.AssertionFailedError: Channel 0 was not ready after 30 > seconds ==> expected: but was: > at > app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151) > at > app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132) > at > app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63) > at > app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36) > at > app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:211) > at > app//org.apache.kafka.common.network.NetworkTestUtils.waitForChannelReady(NetworkTestUtils.java:107) > at > app//org.apache.kafka.common.network.NetworkTestUtils.checkClientConnection(NetworkTestUtils.java:70) > at > app//org.apache.kafka.common.network.SslTransportLayerTest.verifySslConfigs(SslTransportLayerTest.java:1296) > at > app//org.apache.kafka.common.network.SslTransportLayerTest.testValidEndpointIdentificationCN(SslTransportLayerTest.java:202) > org.apache.kafka.common.network.SslTransportLayerTest.testValidEndpointIdentificationCN(Args)[3] > failed, log available in > /home/ijuma/src/kafka/clients/build/reports/testOutput/org.apache.kafka.common.network.SslTransportLayerTest.testValidEndpointIdentificationCN(Args)[3].test.stdout > Gradle Test Run :clients:test > Gradle Test Executor 12 > > SslTransportLayerTest > testValidEndpointIdentificationCN(Args) > [3] > tlsProtocol=TLSv1.3, useInlinePem=false FAILED > org.opentest4j.AssertionFailedError: Channel 0 was not ready after 30 > seconds ==> expected: but was: > at > app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151) > at > app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132) > at > app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63) > at > app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36) > at > app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:211) > at > app//org.apache.kafka.common.network.NetworkTestUtils.waitForChannelReady(NetworkTestUtils.java:107) > at > app//org.apache.kafka.common.network.NetworkTestUtils.checkClientConnection(NetworkTestUtils.java:70) >
[jira] [Resolved] (KAFKA-14891) Fix rack-aware range assignor to improve rack-awareness with co-partitioning
[ https://issues.apache.org/jira/browse/KAFKA-14891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram resolved KAFKA-14891. Fix Version/s: 3.5.0 Reviewer: David Jacot Resolution: Fixed > Fix rack-aware range assignor to improve rack-awareness with co-partitioning > > > Key: KAFKA-14891 > URL: https://issues.apache.org/jira/browse/KAFKA-14891 > Project: Kafka > Issue Type: Bug >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram >Priority: Major > Fix For: 3.5.0 > > > We currently check all states for rack-aware assignment with co-partitioning > ([https://github.com/apache/kafka/blob/396536bb5aa1ba78c71ea824d736640b615bda8a/clients/src/main/java/org/apache/kafka/clients/consumer/RangeAssignor.java#L176).] > We should check each group of co-partitioned states separately so that we > can use rack-aware assignment with co-partitioning for subsets of topics. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14891) Fix rack-aware range assignor to improve rack-awareness with co-partitioning
Rajini Sivaram created KAFKA-14891: -- Summary: Fix rack-aware range assignor to improve rack-awareness with co-partitioning Key: KAFKA-14891 URL: https://issues.apache.org/jira/browse/KAFKA-14891 Project: Kafka Issue Type: Bug Reporter: Rajini Sivaram Assignee: Rajini Sivaram We currently check all states for rack-aware assignment with co-partitioning ([https://github.com/apache/kafka/blob/396536bb5aa1ba78c71ea824d736640b615bda8a/clients/src/main/java/org/apache/kafka/clients/consumer/RangeAssignor.java#L176).] We should check each group of co-partitioned states separately so that we can use rack-aware assignment with co-partitioning for subsets of topics. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-14450) Rack-aware partition assignment for consumers (KIP-881)
[ https://issues.apache.org/jira/browse/KAFKA-14450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram resolved KAFKA-14450. Resolution: Fixed > Rack-aware partition assignment for consumers (KIP-881) > --- > > Key: KAFKA-14450 > URL: https://issues.apache.org/jira/browse/KAFKA-14450 > Project: Kafka > Issue Type: New Feature > Components: consumer >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram >Priority: Major > > Top-level ticket for KIP-881 since we are splitting the PR into 3. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-14452) Make sticky assignors rack-aware if consumer racks are configured.
[ https://issues.apache.org/jira/browse/KAFKA-14452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram resolved KAFKA-14452. Fix Version/s: 3.5.0 Reviewer: David Jacot Resolution: Fixed > Make sticky assignors rack-aware if consumer racks are configured. > -- > > Key: KAFKA-14452 > URL: https://issues.apache.org/jira/browse/KAFKA-14452 > Project: Kafka > Issue Type: Sub-task >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram >Priority: Major > Fix For: 3.5.0 > > > See KIP-881 for details -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-14867) Trigger rebalance when replica racks change if client.rack is configured
[ https://issues.apache.org/jira/browse/KAFKA-14867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram resolved KAFKA-14867. Reviewer: David Jacot Resolution: Fixed > Trigger rebalance when replica racks change if client.rack is configured > > > Key: KAFKA-14867 > URL: https://issues.apache.org/jira/browse/KAFKA-14867 > Project: Kafka > Issue Type: Sub-task > Components: consumer >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram >Priority: Major > Fix For: 3.5.0 > > > To improve locality after reassignments, trigger rebalance from leader if set > of racks of partition replicas change. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14867) Trigger rebalance when replica racks change if client.rack is configured
Rajini Sivaram created KAFKA-14867: -- Summary: Trigger rebalance when replica racks change if client.rack is configured Key: KAFKA-14867 URL: https://issues.apache.org/jira/browse/KAFKA-14867 Project: Kafka Issue Type: Sub-task Components: consumer Reporter: Rajini Sivaram Assignee: Rajini Sivaram Fix For: 3.5.0 To improve locality after reassignments, trigger rebalance from leader if set of racks of partition replicas change. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-14770) Allow dynamic keystore update for brokers if string representation of DN matches even if canonical DNs don't match
[ https://issues.apache.org/jira/browse/KAFKA-14770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram resolved KAFKA-14770. Reviewer: Manikumar Resolution: Fixed > Allow dynamic keystore update for brokers if string representation of DN > matches even if canonical DNs don't match > -- > > Key: KAFKA-14770 > URL: https://issues.apache.org/jira/browse/KAFKA-14770 > Project: Kafka > Issue Type: Improvement > Components: security >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram >Priority: Major > Fix For: 3.5.0 > > > To avoid mistakes during dynamic broker config updates that could potentially > affect clients, we restrict changes that can be performed dynamically without > broker restart. For broker keystore updates, we require the DN to be the same > for the old and new certificates since this could potentially contain host > names used for host name verification by clients. DNs are compared using > standard Java implementation of X500Principal.equals() which compares > canonical names. If tags of fields change from one with a printable string > representation and one without or vice-versa, canonical name check fails even > if the actual name is the same since canonical representation converts to hex > for some tags only. We can relax the verification to allow dynamic updates in > this case by enabling dynamic update if either the canonical name or the > RFC2253 string representation of the DN matches. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-14451) Make range assignor rack-aware if consumer racks are configured
[ https://issues.apache.org/jira/browse/KAFKA-14451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram resolved KAFKA-14451. Fix Version/s: 3.5.0 Reviewer: David Jacot Resolution: Fixed > Make range assignor rack-aware if consumer racks are configured > --- > > Key: KAFKA-14451 > URL: https://issues.apache.org/jira/browse/KAFKA-14451 > Project: Kafka > Issue Type: Sub-task >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram >Priority: Major > Fix For: 3.5.0 > > > See KIP-881 for details -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14770) Allow dynamic keystore update for brokers if string representation of DN matches even if canonical DNs don't match
Rajini Sivaram created KAFKA-14770: -- Summary: Allow dynamic keystore update for brokers if string representation of DN matches even if canonical DNs don't match Key: KAFKA-14770 URL: https://issues.apache.org/jira/browse/KAFKA-14770 Project: Kafka Issue Type: Improvement Components: security Reporter: Rajini Sivaram Assignee: Rajini Sivaram Fix For: 3.5.0 To avoid mistakes during dynamic broker config updates that could potentially affect clients, we restrict changes that can be performed dynamically without broker restart. For broker keystore updates, we require the DN to be the same for the old and new certificates since this could potentially contain host names used for host name verification by clients. DNs are compared using standard Java implementation of X500Principal.equals() which compares canonical names. If tags of fields change from one with a printable string representation and one without or vice-versa, canonical name check fails even if the actual name is the same since canonical representation converts to hex for some tags only. We can relax the verification to allow dynamic updates in this case by enabling dynamic update if either the canonical name or the RFC2253 string representation of the DN matches. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-14676) Token endpoint URL used for OIDC cannot be set on the JAAS config
[ https://issues.apache.org/jira/browse/KAFKA-14676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram resolved KAFKA-14676. Fix Version/s: 3.5.0 3.4.1 3.3.3 Reviewer: Manikumar Resolution: Fixed > Token endpoint URL used for OIDC cannot be set on the JAAS config > - > > Key: KAFKA-14676 > URL: https://issues.apache.org/jira/browse/KAFKA-14676 > Project: Kafka > Issue Type: Bug > Components: security >Affects Versions: 3.1.2, 3.4.0, 3.2.3, 3.3.2 >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram >Priority: Major > Fix For: 3.5.0, 3.4.1, 3.3.3 > > > Kafka allows multiple clients within a JVM to use different SASL > configurations by configuring the JAAS configuration in `sasl.jaas.config` > instead of the JVM-wide system property. For SASL login, we reuse logins > within a JVM by caching logins indexed by their sasl.jaas.config. This relies > on login configs being overridable using `sasl.jaas.config`. > KIP-768 > ([https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=186877575)] > added support for OIDC for SASL/OAUTHBEARER. The token endpoint used to > acquire tokens can currently only be configured using the Kafka config > `sasl.oauthbearer.token.endpoint.url`. This prevents different clients within > a JVM from using different URLs. We need to either provide a way to override > the URL within `sasl.jaas.config` or include more of the client configs in > the LoginMetadata used as key for cached logins. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14676) Token endpoint URL used for OIDC cannot be set on the JAAS config
Rajini Sivaram created KAFKA-14676: -- Summary: Token endpoint URL used for OIDC cannot be set on the JAAS config Key: KAFKA-14676 URL: https://issues.apache.org/jira/browse/KAFKA-14676 Project: Kafka Issue Type: Bug Components: security Affects Versions: 3.3.2, 3.2.3, 3.1.2, 3.4.0 Reporter: Rajini Sivaram Assignee: Rajini Sivaram Kafka allows multiple clients within a JVM to use different SASL configurations by configuring the JAAS configuration in `sasl.jaas.config` instead of the JVM-wide system property. For SASL login, we reuse logins within a JVM by caching logins indexed by their sasl.jaas.config. This relies on login configs being overridable using `sasl.jaas.config`. KIP-768 ([https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=186877575)] added support for OIDC for SASL/OAUTHBEARER. The token endpoint used to acquire tokens can currently only be configured using the Kafka config `sasl.oauthbearer.token.endpoint.url`. This prevents different clients within a JVM from using different URLs. We need to either provide a way to override the URL within `sasl.jaas.config` or include more of the client configs in the LoginMetadata used as key for cached logins. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-14352) Support rack-aware partition assignment for Kafka consumers
[ https://issues.apache.org/jira/browse/KAFKA-14352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram resolved KAFKA-14352. Reviewer: David Jacot Resolution: Fixed This includes protocol changes for rack-aware assignment. Default assignors will be made rack-aware under follow-on tickets in the next release. > Support rack-aware partition assignment for Kafka consumers > --- > > Key: KAFKA-14352 > URL: https://issues.apache.org/jira/browse/KAFKA-14352 > Project: Kafka > Issue Type: Sub-task > Components: consumer >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram >Priority: Major > Fix For: 3.4.0 > > > KIP-392 added support for consumers to fetch from the replica in their local > rack. To benefit from locality, consumers need to be assigned partitions > which have a replica in the same rack. This works well when replication > factor is the same as the number of racks, since every rack would then have a > replica with rack-aware replica assignment. If the number of racks is higher, > some racks may not have replicas of some partitions and hence consumers in > these racks will have to fetch from another rack. It will be useful to > propagate rack in the subscription metadata for consumers and provide a > rack-aware partition assignor for consumers. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14452) Make sticky assignors rack-aware if consumer racks are configured.
Rajini Sivaram created KAFKA-14452: -- Summary: Make sticky assignors rack-aware if consumer racks are configured. Key: KAFKA-14452 URL: https://issues.apache.org/jira/browse/KAFKA-14452 Project: Kafka Issue Type: Sub-task Reporter: Rajini Sivaram Assignee: Rajini Sivaram See KIP-881 for details -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14451) Make range assignor rack-aware if consumer racks are configured
Rajini Sivaram created KAFKA-14451: -- Summary: Make range assignor rack-aware if consumer racks are configured Key: KAFKA-14451 URL: https://issues.apache.org/jira/browse/KAFKA-14451 Project: Kafka Issue Type: Sub-task Reporter: Rajini Sivaram Assignee: Rajini Sivaram See KIP-881 for details -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (KAFKA-14352) Support rack-aware partition assignment for Kafka consumers
[ https://issues.apache.org/jira/browse/KAFKA-14352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram updated KAFKA-14352: --- Parent: KAFKA-14450 Issue Type: Sub-task (was: New Feature) > Support rack-aware partition assignment for Kafka consumers > --- > > Key: KAFKA-14352 > URL: https://issues.apache.org/jira/browse/KAFKA-14352 > Project: Kafka > Issue Type: Sub-task > Components: consumer >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram >Priority: Major > Fix For: 3.4.0 > > > KIP-392 added support for consumers to fetch from the replica in their local > rack. To benefit from locality, consumers need to be assigned partitions > which have a replica in the same rack. This works well when replication > factor is the same as the number of racks, since every rack would then have a > replica with rack-aware replica assignment. If the number of racks is higher, > some racks may not have replicas of some partitions and hence consumers in > these racks will have to fetch from another rack. It will be useful to > propagate rack in the subscription metadata for consumers and provide a > rack-aware partition assignor for consumers. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14450) Rack-aware partition assignment for consumers (KIP-881)
Rajini Sivaram created KAFKA-14450: -- Summary: Rack-aware partition assignment for consumers (KIP-881) Key: KAFKA-14450 URL: https://issues.apache.org/jira/browse/KAFKA-14450 Project: Kafka Issue Type: New Feature Components: consumer Reporter: Rajini Sivaram Assignee: Rajini Sivaram Top-level ticket for KIP-881 since we are splitting the PR into 3. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14352) Support rack-aware partition assignment for Kafka consumers
Rajini Sivaram created KAFKA-14352: -- Summary: Support rack-aware partition assignment for Kafka consumers Key: KAFKA-14352 URL: https://issues.apache.org/jira/browse/KAFKA-14352 Project: Kafka Issue Type: New Feature Components: consumer Reporter: Rajini Sivaram Assignee: Rajini Sivaram Fix For: 3.4.0 KIP-392 added support for consumers to fetch from the replica in their local rack. To benefit from locality, consumers need to be assigned partitions which have a replica in the same rack. This works well when replication factor is the same as the number of racks, since every rack would then have a replica with rack-aware replica assignment. If the number of racks is higher, some racks may not have replicas of some partitions and hence consumers in these racks will have to fetch from another rack. It will be useful to propagate rack in the subscription metadata for consumers and provide a rack-aware partition assignor for consumers. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-13559) The broker's ProduceResponse may be delayed for 300ms
[ https://issues.apache.org/jira/browse/KAFKA-13559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram resolved KAFKA-13559. Fix Version/s: 3.4.0 Reviewer: Rajini Sivaram Resolution: Fixed > The broker's ProduceResponse may be delayed for 300ms > -- > > Key: KAFKA-13559 > URL: https://issues.apache.org/jira/browse/KAFKA-13559 > Project: Kafka > Issue Type: Task > Components: core >Affects Versions: 2.7.0 >Reporter: frankshi >Assignee: Badai Aqrandista >Priority: Major > Fix For: 3.4.0 > > Attachments: image-1.png, image-2.png, > image-2021-12-21-14-44-56-689.png, image-2021-12-21-14-45-22-716.png, > image-3.png, image-5.png, image-6.png, image-7.png, image.png > > > Hi team: > We have found the value in the source code > [here|https://github.com/apache/kafka/blob/2.7/core/src/main/scala/kafka/network/SocketServer.scala#L922] > may lead broker’s ProduceResponse to be delayed for 300ms. > * Server-version: 2.13-2.7.0. > * Client-version: confluent-kafka-python-1.5.0. > we have set the client’s configuration as following: > {code:java} > ling.ms = 0 > acks = 1 > delivery.timeout.ms = 100 > request.timeout.ms = 80 > Sasl.mechanism = “PLAIN” > Security.protocol = “SASL_SSL” > .. > {code} > Because we set ACKs = 1, the client sends ProduceRequests and receives > ProduceResponses from brokers. The leader broker doesn't need to wait for the > ISR’s writing data to disk successfully. It can reply to the client by > sending ProduceResponses directly. In our situation, the ping value between > the client and the kafka brokers is about ~10ms, and most of the time, the > responses are received about 10ms after the Produce requests are sent. But > sometimes the responses are received about ~300ms later. > The following shows the log from the client. > {code:java} > 2021-11-26 02:31:30,567 Sent partial ProduceRequest (v7, 0+16527/37366 > bytes, CorrId 2753) > 2021-11-26 02:31:30,568 Sent partial ProduceRequest (v7, 16527+16384/37366 > bytes, CorrId 2753) > 2021-11-26 02:31:30,568 Sent ProduceRequest (v7, 37366 bytes @ 32911, CorrId > 2753) > 2021-11-26 02:31:30,570 Sent ProduceRequest (v7, 4714 bytes @ 0, CorrId 2754) > 2021-11-26 02:31:30,571 Sent ProduceRequest (v7, 1161 bytes @ 0, CorrId 2755) > 2021-11-26 02:31:30,572 Sent ProduceRequest (v7, 1240 bytes @ 0, CorrId 2756) > 2021-11-26 02:31:30,572 Received ProduceResponse (v7, 69 bytes, CorrId 2751, > rtt 9.79ms) > 2021-11-26 02:31:30,572 Received ProduceResponse (v7, 69 bytes, CorrId 2752, > rtt 10.34ms) > 2021-11-26 02:31:30,573 Received ProduceResponse (v7, 69 bytes, CorrId 2753, > rtt 10.11ms) > 2021-11-26 02:31:30,872 Received ProduceResponse (v7, 69 bytes, CorrId 2754, > rtt 309.69ms) > 2021-11-26 02:31:30,883 Sent ProduceRequest (v7, 1818 bytes @ 0, CorrId 2757) > 2021-11-26 02:31:30,887 Sent ProduceRequest (v7, 1655 bytes @ 0, CorrId 2758) > 2021-11-26 02:31:30,888 Received ProduceResponse (v7, 69 bytes, CorrId 2755, > rtt 318.85ms) > 2021-11-26 02:31:30,893 Sent partial ProduceRequest (v7, 0+16527/37562 > bytes, CorrId 2759) > 2021-11-26 02:31:30,894 Sent partial ProduceRequest (v7, 16527+16384/37562 > bytes, CorrId 2759) > 2021-11-26 02:31:30,895 Sent ProduceRequest (v7, 37562 bytes @ 32911, CorrId > 2759) > 2021-11-26 02:31:30,896 Sent ProduceRequest (v7, 4700 bytes @ 0, CorrId 2760) > 2021-11-26 02:31:30,897 Received ProduceResponse (v7, 69 bytes, CorrId 2756, > rtt 317.74ms) > 2021-11-26 02:31:30,897 Received ProduceResponse (v7, 69 bytes, CorrId 2757, > rtt 4.22ms) > 2021-11-26 02:31:30,899 Received ProduceResponse (v7, 69 bytes, CorrId 2758, > rtt 2.61ms){code} > > The requests of CorrId 2753 and 2754 are almost sent at the same time, but > the Response of 2754 is delayed for ~300ms. > We checked the logs on the broker. > > {code:java} > [2021-11-26 02:31:30,873] DEBUG Completed > request:RequestHeader(apiKey=PRODUCE, apiVersion=7, clientId=rdkafka, > correlationId=2754) – {acks=1,timeout=80,numPartitions=1},response: > {responses=[\{topic=***,partition_responses=[{partition=32,error_code=0,base_offset=58625,log_append_time=-1,log_start_offset=49773}]} > ],throttle_time_ms=0} from connection > 10.10.44.59:9093-10.10.0.68:31183-66;totalTime:0.852,requestQueueTime:0.128,localTime:0.427,remoteTime:0.09,throttleTime:0,responseQueueTime:0.073,sendTime:0.131,securityProtocol:SASL_SSL,principal:User:***,listener:SASL_SSL,clientInformation:ClientInformation(softwareName=confluent-kafka-python, > softwareVersion=1.5.0-rdkafka-1.5.2) (kafka.request.logger) > {code} > > > It seems that the time cost on the server side is very small. What’s the > reason for the latency spikes? > We also did tcpdump at the server side and
[jira] [Resolved] (KAFKA-13539) Improve propagation and processing of SSL handshake failures
[ https://issues.apache.org/jira/browse/KAFKA-13539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram resolved KAFKA-13539. Reviewer: Manikumar Resolution: Fixed > Improve propagation and processing of SSL handshake failures > > > Key: KAFKA-13539 > URL: https://issues.apache.org/jira/browse/KAFKA-13539 > Project: Kafka > Issue Type: Bug > Components: security >Affects Versions: 3.1.0 >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram >Priority: Major > Fix For: 3.2.0 > > > {color:#172b4d}When server fails SSL handshake and closes its connection, we > attempt to report this to clients on a best-effort basis. However, our tests > assume that peer always detects the failure. This may not be the case when > there are delays. It will be good to improve reliability of handshake failure > reporting. {color} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (KAFKA-13539) Improve propagation and processing of SSL handshake failures
Rajini Sivaram created KAFKA-13539: -- Summary: Improve propagation and processing of SSL handshake failures Key: KAFKA-13539 URL: https://issues.apache.org/jira/browse/KAFKA-13539 Project: Kafka Issue Type: Bug Components: security Affects Versions: 3.1.0 Reporter: Rajini Sivaram Assignee: Rajini Sivaram Fix For: 3.2.0 {color:#172b4d}When server fails SSL handshake and closes its connection, we attempt to report this to clients on a best-effort basis. However, our tests assume that peer always detects the failure. This may not be the case when there are delays. It will be good to improve reliability of handshake failure reporting. {color} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (KAFKA-13461) KafkaController stops functioning as active controller after ZooKeeperClient auth failure
[ https://issues.apache.org/jira/browse/KAFKA-13461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram resolved KAFKA-13461. Fix Version/s: 3.1.0 3.0.1 Reviewer: Jun Rao Resolution: Fixed > KafkaController stops functioning as active controller after ZooKeeperClient > auth failure > - > > Key: KAFKA-13461 > URL: https://issues.apache.org/jira/browse/KAFKA-13461 > Project: Kafka > Issue Type: Bug > Components: zkclient >Reporter: Vincent Jiang >Assignee: Rajini Sivaram >Priority: Major > Fix For: 3.1.0, 3.0.1 > > > When java.security.auth.login.config is present, but there is no "Client" > section, ZookeeperSaslClient creation fails and raises LoginExcpetion, > result in warning log: > {code:java} > WARN SASL configuration failed: javax.security.auth.login.LoginException: No > JAAS configuration section named 'Client' was found in specified JAAS > configuration file: '***'. Will continue connection to Zookeeper server > without SASL authentication, if Zookeeper server allows it.{code} > When this happens after initial startup, ClientCnxn enqueues an AuthFailed > event which will trigger following sequence: > # zkclient reinitialization is triggered > # the controller resigns. > # Before the controller's ZK session expires, the controller successfully > connect to ZK and maintains the current session > # In KafkaController.elect(), the controller sets activeControllerId to > itself and short-circuits the rest of the elect. Since the controller > resigned earlier and also skips the call to onControllerFailover(), the > controller is not actually functioning as the active controller (e.g. the > necessary ZK watchers haven't been registered). > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (KAFKA-13405) Kafka Streams restore-consumer fails to refresh broker IPs after upgrading Kafka cluster
[ https://issues.apache.org/jira/browse/KAFKA-13405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17446375#comment-17446375 ] Rajini Sivaram commented on KAFKA-13405: I am not familiar with the specific restore consumer referred to in the ticket, but Kafka producers and consumers resolve bootstrap servers when they start up. All the matching addresses are cached, so some changes are ok. Clients which are connected to brokers receive metadata from brokers and will continue to work with a rolling upgrade that changes all IPs since those get re-resolved. But if a client started up with bootstrap servers that resolved to old IPs and attempts to get metadata using bootstrap servers after all ips changed, then that wouldn't work because we don't resolve bootstrap servers again. Looks like that is the scenario with the restore consumer. > Kafka Streams restore-consumer fails to refresh broker IPs after upgrading > Kafka cluster > > > Key: KAFKA-13405 > URL: https://issues.apache.org/jira/browse/KAFKA-13405 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 2.8.0, 2.7.1 >Reporter: Daniel O'Halloran >Priority: Critical > Attachments: KafkaConfig.txt, KafkaLogs.txt > > > *+Description+* > After upgrading our Kafka clusters from 2.7 to 2.8 the Streams > restore-consumers never update their broker IPs. > The applications continue to process events as normal, until there is a > rebalance. > Once a rebalance occurs the restore consumers attempts to connect to the old > brokers IPs indefinitely and the streams tasks never go back into a RUNNING > state. > We were able to replicate this behaviour with kafka-streams client libraries > 2.5.1, 2.7.1 and 2.8.0 > > *+Steps to reproduce+* > # Upgrade brokers from Kafka 2.7 to Kafka 2.8 > # Ensure old brokers are completely shut down > # Trigger a rebalance of a streams application > > *+Expected result+* > * Application rebalances as normal > > *+Actual Result+* > * Application cannot restore its data > * restore consumer tries to connect to old brokers indefinitely > > *+Observations+* > * The cluster metadata was updated on all stream consumer threads during the > broker upgrade (multiple times) as the new brokers were brought online > (corresponding to leader election occurring on the subscribed partitions), > however no cluster metadata was logged on the restore-consumer thread. > * None of the original broker IPs are valid/accessible after the upgrade (as > expected) > * No partition rebalance occurs during the kafka upgrade process. > * When the first re-balance was triggered after upgrade, the > restore-consumer loops failing to connect on each of the 3 original IPs, but > none of the new broker IPs. > > *+Workaround+* > * Restart our applications after upgrading our Kafka cluster -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (KAFKA-13422) Even if the correct username and password are configured, when ClientBroker or KafkaClient tries to establish a SASL connection to ServerBroker, an exception is thrown
[ https://issues.apache.org/jira/browse/KAFKA-13422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444517#comment-17444517 ] Rajini Sivaram commented on KAFKA-13422: Yes, I remember now, we do take the first entry from Subject in the default callback handlers, so the current behaviour isn't great. But I am still unconvinced about removing support for a case that has always existed and can be made to work as required when implementing custom login modules and callback handlers. When we added `sasl.jaas.config` later, we chose to restrict to one login module to avoid the issues that existed with the JAAS file implementation. And now that we recommend the properties option anyway, I am not sure of the value of failing early in the JAAS file case. Perhaps logging an error instead of failing altogether would be better? [~ijuma] What do you think? > Even if the correct username and password are configured, when ClientBroker > or KafkaClient tries to establish a SASL connection to ServerBroker, an > exception is thrown: (Authentication failed: Invalid username or password) > -- > > Key: KAFKA-13422 > URL: https://issues.apache.org/jira/browse/KAFKA-13422 > Project: Kafka > Issue Type: Bug > Components: clients, core >Affects Versions: 2.7.1, 3.0.0 >Reporter: RivenSun >Priority: Major > Attachments: CustomerAuthCallbackHandler.java, > LoginContext_login_debug.png, SaslClientCallbackHandler_handle_debug.png > > > > h1. Foreword: > When deploying a Kafka cluster with a higher version (2.7.1), I encountered > an exception of communication identity authentication failure between > brokers. In the current latest version 3.0.0, this problem can also be > reproduced. > h1. Problem recurring: > h2. 1)broker Version is 3.0.0 > h3. The content of kafka_server_jaas.conf of each broker is exactly the same, > the content is as follows: > > > {code:java} > KafkaServer { > org.apache.kafka.common.security.plain.PlainLoginModule required > username="admin" > password="kJTVDziatPgjXG82sFHc4O1EIuewmlvS" > user_admin="kJTVDziatPgjXG82sFHc4O1EIuewmlvS" > user_alice="alice"; > org.apache.kafka.common.security.scram.ScramLoginModule required > username="admin_scram" > password="admin_scram_password"; > > }; > {code} > > > h3. broker server.properties: > One of the broker configuration files is provided, and the content of the > configuration files of other brokers is only different from the localPublicIp > of advertised.listeners. > > {code:java} > broker.id=1 > broker.rack=us-east-1a > advertised.listeners=SASL_PLAINTEXT://localPublicIp:9779,SASL_SSL://localPublicIp:9889,INTERNAL_SSL://:9009,PLAIN_PLUGIN_SSL://localPublicIp:9669 > log.dirs=/asyncmq/kafka/data_1,/asyncmq/kafka/data_2 > zookeeper.connect=*** > listeners=SASL_PLAINTEXT://:9779,SASL_SSL://:9889,INTERNAL_SSL://:9009,PLAIN_PLUGIN_SSL://:9669 > listener.security.protocol.map=INTERNAL_SSL:SASL_SSL,PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL,PLAIN_PLUGIN_SSL:SASL_SSL > listener.name.plain_plugin_ssl.plain.sasl.server.callback.handler.class=org.apache.kafka.common.security.plain.internals.PlainServerCallbackHandler > #ssl config > ssl.keystore.password=*** > ssl.key.password=*** > ssl.truststore.password=*** > ssl.keystore.location=*** > ssl.truststore.location=*** > ssl.client.auth=none > ssl.endpoint.identification.algorithm= > #broker communicate config > #security.inter.broker.protocol=SASL_PLAINTEXT > inter.broker.listener.name=INTERNAL_SSL > sasl.mechanism.inter.broker.protocol=PLAIN > #sasl authentication config > sasl.kerberos.service.name=kafka > sasl.enabled.mechanisms=PLAIN,SCRAM-SHA-256,SCRAM-SHA-512,GSSAPI > delegation.token.master.key=*** > delegation.token.expiry.time.ms=8640 > delegation.token.max.lifetime.ms=31536 > {code} > > > Then start all brokers at the same time. Each broker has actually been > started successfully, but when establishing a connection between the > controller node and all brokers, the identity authentication has always > failed. The connection between brokers cannot be established normally, > causing the entire Kafka cluster to be unable to provide external services. > h3. The server log keeps printing abnormally like crazy: > The real ip sensitive information of the broker in the log, I use ** > instead of here > > {code:java} > [2021-10-29 14:16:19,831] INFO [SocketServer listenerType=ZK_BROKER, > nodeId=3] Started socket server acceptors and processors > (kafka.network.SocketServer) > [2021-10-29 14:16:19,836] INFO Kafka version: 3.0.0 >
[jira] [Commented] (KAFKA-13422) Even if the correct username and password are configured, when ClientBroker or KafkaClient tries to establish a SASL connection to ServerBroker, an exception is thrown
[ https://issues.apache.org/jira/browse/KAFKA-13422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1738#comment-1738 ] Rajini Sivaram commented on KAFKA-13422: [~RivenSun] We should have probably added that check earlier. But changing it now from picking the first one to disallowing multiple may be a breaking change. We can log an error instead of throwing exception if that helps. > Even if the correct username and password are configured, when ClientBroker > or KafkaClient tries to establish a SASL connection to ServerBroker, an > exception is thrown: (Authentication failed: Invalid username or password) > -- > > Key: KAFKA-13422 > URL: https://issues.apache.org/jira/browse/KAFKA-13422 > Project: Kafka > Issue Type: Bug > Components: clients, core >Affects Versions: 2.7.1, 3.0.0 >Reporter: RivenSun >Priority: Major > Attachments: CustomerAuthCallbackHandler.java, > LoginContext_login_debug.png, SaslClientCallbackHandler_handle_debug.png > > > > h1. Foreword: > When deploying a Kafka cluster with a higher version (2.7.1), I encountered > an exception of communication identity authentication failure between > brokers. In the current latest version 3.0.0, this problem can also be > reproduced. > h1. Problem recurring: > h2. 1)broker Version is 3.0.0 > h3. The content of kafka_server_jaas.conf of each broker is exactly the same, > the content is as follows: > > > {code:java} > KafkaServer { > org.apache.kafka.common.security.plain.PlainLoginModule required > username="admin" > password="kJTVDziatPgjXG82sFHc4O1EIuewmlvS" > user_admin="kJTVDziatPgjXG82sFHc4O1EIuewmlvS" > user_alice="alice"; > org.apache.kafka.common.security.scram.ScramLoginModule required > username="admin_scram" > password="admin_scram_password"; > > }; > {code} > > > h3. broker server.properties: > One of the broker configuration files is provided, and the content of the > configuration files of other brokers is only different from the localPublicIp > of advertised.listeners. > > {code:java} > broker.id=1 > broker.rack=us-east-1a > advertised.listeners=SASL_PLAINTEXT://localPublicIp:9779,SASL_SSL://localPublicIp:9889,INTERNAL_SSL://:9009,PLAIN_PLUGIN_SSL://localPublicIp:9669 > log.dirs=/asyncmq/kafka/data_1,/asyncmq/kafka/data_2 > zookeeper.connect=*** > listeners=SASL_PLAINTEXT://:9779,SASL_SSL://:9889,INTERNAL_SSL://:9009,PLAIN_PLUGIN_SSL://:9669 > listener.security.protocol.map=INTERNAL_SSL:SASL_SSL,PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL,PLAIN_PLUGIN_SSL:SASL_SSL > listener.name.plain_plugin_ssl.plain.sasl.server.callback.handler.class=org.apache.kafka.common.security.plain.internals.PlainServerCallbackHandler > #ssl config > ssl.keystore.password=*** > ssl.key.password=*** > ssl.truststore.password=*** > ssl.keystore.location=*** > ssl.truststore.location=*** > ssl.client.auth=none > ssl.endpoint.identification.algorithm= > #broker communicate config > #security.inter.broker.protocol=SASL_PLAINTEXT > inter.broker.listener.name=INTERNAL_SSL > sasl.mechanism.inter.broker.protocol=PLAIN > #sasl authentication config > sasl.kerberos.service.name=kafka > sasl.enabled.mechanisms=PLAIN,SCRAM-SHA-256,SCRAM-SHA-512,GSSAPI > delegation.token.master.key=*** > delegation.token.expiry.time.ms=8640 > delegation.token.max.lifetime.ms=31536 > {code} > > > Then start all brokers at the same time. Each broker has actually been > started successfully, but when establishing a connection between the > controller node and all brokers, the identity authentication has always > failed. The connection between brokers cannot be established normally, > causing the entire Kafka cluster to be unable to provide external services. > h3. The server log keeps printing abnormally like crazy: > The real ip sensitive information of the broker in the log, I use ** > instead of here > > {code:java} > [2021-10-29 14:16:19,831] INFO [SocketServer listenerType=ZK_BROKER, > nodeId=3] Started socket server acceptors and processors > (kafka.network.SocketServer) > [2021-10-29 14:16:19,836] INFO Kafka version: 3.0.0 > (org.apache.kafka.common.utils.AppInfoParser) > [2021-10-29 14:16:19,836] INFO Kafka commitId: 8cb0a5e9d3441962 > (org.apache.kafka.common.utils.AppInfoParser) > [2021-10-29 14:16:19,836] INFO Kafka startTimeMs: 1635516979831 > (org.apache.kafka.common.utils.AppInfoParser) > [2021-10-29 14:16:19,837] INFO [KafkaServer id=3] started > (kafka.server.KafkaServer) > [2021-10-29 14:16:20,249] INFO [SocketServer listenerType=ZK_BROKER, > nodeId=3] Failed authentication
[jira] [Commented] (KAFKA-13422) Even if the correct username and password are configured, when ClientBroker or KafkaClient tries to establish a SASL connection to ServerBroker, an exception is thrown
[ https://issues.apache.org/jira/browse/KAFKA-13422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1710#comment-1710 ] Rajini Sivaram commented on KAFKA-13422: [~RivenSun] Producers, consumers and other clients should be provided a JAAS configuration with a single login module, because they only ever use a single mechanism. We haven't supported the case where clients can choose between multiple logins either through the JAAS config file or the broker property. Kafka's SASL integration is highly configurable. So if you do have an use case where you need to store a range of credentials in one JAAS file and have Kafka use one of these, you can write your own login callback handler and configure `sasl.login.callback.handler.class`. Then you can introduce your own logic based on some options to choose the appropriate credentials. > Even if the correct username and password are configured, when ClientBroker > or KafkaClient tries to establish a SASL connection to ServerBroker, an > exception is thrown: (Authentication failed: Invalid username or password) > -- > > Key: KAFKA-13422 > URL: https://issues.apache.org/jira/browse/KAFKA-13422 > Project: Kafka > Issue Type: Bug > Components: clients, core >Affects Versions: 2.7.1, 3.0.0 >Reporter: RivenSun >Priority: Major > Attachments: CustomerAuthCallbackHandler.java, > LoginContext_login_debug.png, SaslClientCallbackHandler_handle_debug.png > > > > h1. Foreword: > When deploying a Kafka cluster with a higher version (2.7.1), I encountered > an exception of communication identity authentication failure between > brokers. In the current latest version 3.0.0, this problem can also be > reproduced. > h1. Problem recurring: > h2. 1)broker Version is 3.0.0 > h3. The content of kafka_server_jaas.conf of each broker is exactly the same, > the content is as follows: > > > {code:java} > KafkaServer { > org.apache.kafka.common.security.plain.PlainLoginModule required > username="admin" > password="kJTVDziatPgjXG82sFHc4O1EIuewmlvS" > user_admin="kJTVDziatPgjXG82sFHc4O1EIuewmlvS" > user_alice="alice"; > org.apache.kafka.common.security.scram.ScramLoginModule required > username="admin_scram" > password="admin_scram_password"; > > }; > {code} > > > h3. broker server.properties: > One of the broker configuration files is provided, and the content of the > configuration files of other brokers is only different from the localPublicIp > of advertised.listeners. > > {code:java} > broker.id=1 > broker.rack=us-east-1a > advertised.listeners=SASL_PLAINTEXT://localPublicIp:9779,SASL_SSL://localPublicIp:9889,INTERNAL_SSL://:9009,PLAIN_PLUGIN_SSL://localPublicIp:9669 > log.dirs=/asyncmq/kafka/data_1,/asyncmq/kafka/data_2 > zookeeper.connect=*** > listeners=SASL_PLAINTEXT://:9779,SASL_SSL://:9889,INTERNAL_SSL://:9009,PLAIN_PLUGIN_SSL://:9669 > listener.security.protocol.map=INTERNAL_SSL:SASL_SSL,PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL,PLAIN_PLUGIN_SSL:SASL_SSL > listener.name.plain_plugin_ssl.plain.sasl.server.callback.handler.class=org.apache.kafka.common.security.plain.internals.PlainServerCallbackHandler > #ssl config > ssl.keystore.password=*** > ssl.key.password=*** > ssl.truststore.password=*** > ssl.keystore.location=*** > ssl.truststore.location=*** > ssl.client.auth=none > ssl.endpoint.identification.algorithm= > #broker communicate config > #security.inter.broker.protocol=SASL_PLAINTEXT > inter.broker.listener.name=INTERNAL_SSL > sasl.mechanism.inter.broker.protocol=PLAIN > #sasl authentication config > sasl.kerberos.service.name=kafka > sasl.enabled.mechanisms=PLAIN,SCRAM-SHA-256,SCRAM-SHA-512,GSSAPI > delegation.token.master.key=*** > delegation.token.expiry.time.ms=8640 > delegation.token.max.lifetime.ms=31536 > {code} > > > Then start all brokers at the same time. Each broker has actually been > started successfully, but when establishing a connection between the > controller node and all brokers, the identity authentication has always > failed. The connection between brokers cannot be established normally, > causing the entire Kafka cluster to be unable to provide external services. > h3. The server log keeps printing abnormally like crazy: > The real ip sensitive information of the broker in the log, I use ** > instead of here > > {code:java} > [2021-10-29 14:16:19,831] INFO [SocketServer listenerType=ZK_BROKER, > nodeId=3] Started socket server acceptors and processors > (kafka.network.SocketServer) > [2021-10-29 14:16:19,836] INFO Kafka version: 3.0.0 >
[jira] [Commented] (KAFKA-13422) Even if the correct username and password are configured, when ClientBroker or KafkaClient tries to establish a SASL connection to ServerBroker, an exception is thrown
[ https://issues.apache.org/jira/browse/KAFKA-13422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444396#comment-17444396 ] Rajini Sivaram commented on KAFKA-13422: [~RivenSun] The recommended way for configuring SASL for brokers is by specifying `listener.name...sasl.jaas.config` in the broker's server.properties file rather than using a separate JAAS config file. Using multiple login modules for KafkaServer works in some cases, but as you mentioned, that doesn't work when multiple mechanisms with conflicting options are specified for the same listener. That is a known issue that wasn't fixed because we have an alternative that works better with other Kafka features like password protection. With the broker property `sasl.jaas.config`, you specify the config separately for each listener and SASL mechanism and that avoids these type of conflicts. > Even if the correct username and password are configured, when ClientBroker > or KafkaClient tries to establish a SASL connection to ServerBroker, an > exception is thrown: (Authentication failed: Invalid username or password) > -- > > Key: KAFKA-13422 > URL: https://issues.apache.org/jira/browse/KAFKA-13422 > Project: Kafka > Issue Type: Bug > Components: clients, core >Affects Versions: 2.7.1, 3.0.0 >Reporter: RivenSun >Priority: Major > Attachments: CustomerAuthCallbackHandler.java, > LoginContext_login_debug.png, SaslClientCallbackHandler_handle_debug.png > > > > h1. Foreword: > When deploying a Kafka cluster with a higher version (2.7.1), I encountered > an exception of communication identity authentication failure between > brokers. In the current latest version 3.0.0, this problem can also be > reproduced. > h1. Problem recurring: > h2. 1)broker Version is 3.0.0 > h3. The content of kafka_server_jaas.conf of each broker is exactly the same, > the content is as follows: > > > {code:java} > KafkaServer { > org.apache.kafka.common.security.plain.PlainLoginModule required > username="admin" > password="kJTVDziatPgjXG82sFHc4O1EIuewmlvS" > user_admin="kJTVDziatPgjXG82sFHc4O1EIuewmlvS" > user_alice="alice"; > org.apache.kafka.common.security.scram.ScramLoginModule required > username="admin_scram" > password="admin_scram_password"; > > }; > {code} > > > h3. broker server.properties: > One of the broker configuration files is provided, and the content of the > configuration files of other brokers is only different from the localPublicIp > of advertised.listeners. > > {code:java} > broker.id=1 > broker.rack=us-east-1a > advertised.listeners=SASL_PLAINTEXT://localPublicIp:9779,SASL_SSL://localPublicIp:9889,INTERNAL_SSL://:9009,PLAIN_PLUGIN_SSL://localPublicIp:9669 > log.dirs=/asyncmq/kafka/data_1,/asyncmq/kafka/data_2 > zookeeper.connect=*** > listeners=SASL_PLAINTEXT://:9779,SASL_SSL://:9889,INTERNAL_SSL://:9009,PLAIN_PLUGIN_SSL://:9669 > listener.security.protocol.map=INTERNAL_SSL:SASL_SSL,PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL,PLAIN_PLUGIN_SSL:SASL_SSL > listener.name.plain_plugin_ssl.plain.sasl.server.callback.handler.class=org.apache.kafka.common.security.plain.internals.PlainServerCallbackHandler > #ssl config > ssl.keystore.password=*** > ssl.key.password=*** > ssl.truststore.password=*** > ssl.keystore.location=*** > ssl.truststore.location=*** > ssl.client.auth=none > ssl.endpoint.identification.algorithm= > #broker communicate config > #security.inter.broker.protocol=SASL_PLAINTEXT > inter.broker.listener.name=INTERNAL_SSL > sasl.mechanism.inter.broker.protocol=PLAIN > #sasl authentication config > sasl.kerberos.service.name=kafka > sasl.enabled.mechanisms=PLAIN,SCRAM-SHA-256,SCRAM-SHA-512,GSSAPI > delegation.token.master.key=*** > delegation.token.expiry.time.ms=8640 > delegation.token.max.lifetime.ms=31536 > {code} > > > Then start all brokers at the same time. Each broker has actually been > started successfully, but when establishing a connection between the > controller node and all brokers, the identity authentication has always > failed. The connection between brokers cannot be established normally, > causing the entire Kafka cluster to be unable to provide external services. > h3. The server log keeps printing abnormally like crazy: > The real ip sensitive information of the broker in the log, I use ** > instead of here > > {code:java} > [2021-10-29 14:16:19,831] INFO [SocketServer listenerType=ZK_BROKER, > nodeId=3] Started socket server acceptors and processors > (kafka.network.SocketServer) > [2021-10-29 14:16:19,836] INFO Kafka version:
[jira] [Commented] (KAFKA-13293) Support client reload of PEM certificates
[ https://issues.apache.org/jira/browse/KAFKA-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17414263#comment-17414263 ] Rajini Sivaram commented on KAFKA-13293: [~teabot] Sorry, I should have said `recreates SSLContext` rather than `SslEngineFactory`. I haven't tried it out, but I think you can implement a custom factory that has a mutable SSLContext which is updated when required. Basically we want to make sure `CustomSslEngineFactory` always has a valid `SSLContext` that is used by SslEngineFactory#createClientSslEngine() to create a new SSLEngine whenever a new connection is established, but we can swap out the SSLContext with a new one if we can detect a change. As you said, we cannot rely on `SslEngineFactory#shouldBeRebuilt()` since that is invoked only for brokers when ALTER_CONFIGS is processed. > Support client reload of PEM certificates > - > > Key: KAFKA-13293 > URL: https://issues.apache.org/jira/browse/KAFKA-13293 > Project: Kafka > Issue Type: Improvement > Components: clients, security >Affects Versions: 2.7.0, 2.8.0, 2.7.1 >Reporter: Elliot West >Priority: Major > > Since Kafka 2.7.0, clients are able to authenticate using PEM certificates as > client configuration properties in addition to JKS file based key stores > (KAFKA-10338). With PEM, certificate chains are passed into clients as simple > string based key-value properties, alongside existing client configuration. > This offers a number of benefits: it provides a JVM agnostic security > mechanism from the perspective of clients, removes the client's dependency on > the local filesystem, and allows the the encapsulation of the entire client > configuration into a single payload. > However, the current client PEM implement has a feature regression when > compared with the JKS implementation. With the JKS approach, clients would > automatically reload certificates when the key stores were modified on disk. > This enables a seamless approach for the replacement of certificates when > they are due to expire; no further configuration or explicit interference > with the client lifecycle is needed for the client to migrate to renewed > certificates. > Such a capability does not currently exist for PEM. One supplies key chains > when instantiating clients only - there is no mechanism available to either > directly reconfigure the client, or for the client to observe changes to the > original properties set reference used in construction. Additionally, no > work-arounds are documented that might given users alternative strategies for > dealing with expiring certificates. Given that expiration and renewal of > certificates is an industry standard practice, it could be argued that the > current PEM client implementation is not fit for purpose. > In summary, a mechanism should be provided such that clients can > automatically detect, load, and use updated PEM key chains from some non-file > based source (object ref, method invocation, listener, etc.) > Finally, It is suggested that in the short-term Kafka documentation be > updated to describe any viable mechanism for updating client PEM certs > (perhaps closing existing client and then recreating?). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-13293) Support client reload of PEM certificates
[ https://issues.apache.org/jira/browse/KAFKA-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17414217#comment-17414217 ] Rajini Sivaram commented on KAFKA-13293: [~teabot] Yes, agree that it will be useful to support certificate updates without restart for clients. 1) At the moment, as you said you will need to pull in new certs and restart clients if you are using Kafka without custom plugins. 2) If you are willing to add a custom plugin, you can add a custom implementation of `ssl.engine.factory.class` that recreates the SslEngineFactory when certs change by watching key/truststore files or pulling in PEM configs from an external source. 3) For the longer term, [https://cwiki.apache.org/confluence/display/KAFKA/KIP-649%3A+Dynamic+Client+Configuration] will be useful to perform dynamic updates for clients similar to how we do dynamic broker config updates. > Support client reload of PEM certificates > - > > Key: KAFKA-13293 > URL: https://issues.apache.org/jira/browse/KAFKA-13293 > Project: Kafka > Issue Type: Improvement > Components: clients, security >Affects Versions: 2.7.0, 2.8.0, 2.7.1 >Reporter: Elliot West >Priority: Major > > Since Kafka 2.7.0, clients are able to authenticate using PEM certificates as > client configuration properties in addition to JKS file based key stores > (KAFKA-10338). With PEM, certificate chains are passed into clients as simple > string based key-value properties, alongside existing client configuration. > This offers a number of benefits: it provides a JVM agnostic security > mechanism from the perspective of clients, removes the client's dependency on > the local filesystem, and allows the the encapsulation of the entire client > configuration into a single payload. > However, the current client PEM implement has a feature regression when > compared with the JKS implementation. With the JKS approach, clients would > automatically reload certificates when the key stores were modified on disk. > This enables a seamless approach for the replacement of certificates when > they are due to expire; no further configuration or explicit interference > with the client lifecycle is needed for the client to migrate to renewed > certificates. > Such a capability does not currently exist for PEM. One supplies key chains > when instantiating clients only - there is no mechanism available to either > directly reconfigure the client, or for the client to observe changes to the > original properties set reference used in construction. Additionally, no > work-arounds are documented that might given users alternative strategies for > dealing with expiring certificates. Given that expiration and renewal of > certificates is an industry standard practice, it could be argued that the > current PEM client implementation is not fit for purpose. > In summary, a mechanism should be provided such that clients can > automatically detect, load, and use updated PEM key chains from some non-file > based source (object ref, method invocation, listener, etc.) > Finally, It is suggested that in the short-term Kafka documentation be > updated to describe any viable mechanism for updating client PEM certs > (perhaps closing existing client and then recreating?). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12751) ISRs remain in in-flight state if proposed state is same as actual state
[ https://issues.apache.org/jira/browse/KAFKA-12751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17414212#comment-17414212 ] Rajini Sivaram commented on KAFKA-12751: The bug only happens in an unusual case where the proposed ISR is the same as the expected ISR and the chances of hitting that are very low. And it is an issue only if `inter.broker.protocol.version >= 2.7`. If it does occur, further ISR updates don't occur, so the broker will need to be restarted. Release plan for 2.8.1 is here: [https://cwiki.apache.org/confluence/display/KAFKA/Release+Plan+2.8.1,] work is under way and the release is expected in within the next few weeks. > ISRs remain in in-flight state if proposed state is same as actual state > > > Key: KAFKA-12751 > URL: https://issues.apache.org/jira/browse/KAFKA-12751 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.7.0, 2.8.0, 2.7.1 >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram >Priority: Critical > Fix For: 2.7.2, 2.8.1 > > > If proposed ISR state in an AlterIsr request is the same as the actual state, > Controller returns a successful response without performing any updates. But > the broker code that processes the response leaves the ISR state in in-flight > state without committing. This prevents further ISR updates until the next > leader election. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-13293) Support client reload of PEM certificates
[ https://issues.apache.org/jira/browse/KAFKA-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17414137#comment-17414137 ] Rajini Sivaram commented on KAFKA-13293: [~teabot] Which client (and configs) do you use to dynamically reload JKS keystores in clients today? As far as I understand, default implementation in Apache Kafka supports reloading of keystores and truststores from files only for brokers. Clients can override `ssl.engine.factory.class`, but not sure if that is what you were referring to for dynamic reloading in clients. > Support client reload of PEM certificates > - > > Key: KAFKA-13293 > URL: https://issues.apache.org/jira/browse/KAFKA-13293 > Project: Kafka > Issue Type: Improvement > Components: clients, security >Affects Versions: 2.7.0, 2.8.0, 2.7.1 >Reporter: Elliot West >Priority: Major > > Since Kafka 2.7.0, clients are able to authenticate using PEM certificates as > client configuration properties in addition to JKS file based key stores > (KAFKA-10338). With PEM, certificate chains are passed into clients as simple > string based key-value properties, alongside existing client configuration. > This offers a number of benefits: it provides a JVM agnostic security > mechanism from the perspective of clients, removes the client's dependency on > the local filesystem, and allows the the encapsulation of the entire client > configuration into a single payload. > However, the current client PEM implement has a feature regression when > compared with the JKS implementation. With the JKS approach, clients would > automatically reload certificates when the key stores were modified on disk. > This enables a seamless approach for the replacement of certificates when > they are due to expire; no further configuration or explicit interference > with the client lifecycle is needed for the client to migrate to renewed > certificates. > Such a capability does not currently exist for PEM. One supplies key chains > when instantiating clients only - there is no mechanism available to either > directly reconfigure the client, or for the client to observe changes to the > original properties set reference used in construction. Additionally, no > work-arounds are documented that might given users alternative strategies for > dealing with expiring certificates. Given that expiration and renewal of > certificates is an industry standard practice, it could be argued that the > current PEM client implementation is not fit for purpose. > In summary, a mechanism should be provided such that clients can > automatically detect, load, and use updated PEM key chains from some non-file > based source (object ref, method invocation, listener, etc.) > Finally, It is suggested that in the short-term Kafka documentation be > updated to describe any viable mechanism for updating client PEM certs > (perhaps closing existing client and then recreating?). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (KAFKA-10338) Support PEM format for SSL certificates and private key
[ https://issues.apache.org/jira/browse/KAFKA-10338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17413395#comment-17413395 ] Rajini Sivaram edited comment on KAFKA-10338 at 9/10/21, 9:27 PM: -- [~teabot] We currently don't have a way of reconfiguring PEM configs for clients unless they are stored externally in a file and the file is reloaded. It may be possible to add a custom `ssl.engine.factory.class` that does reconfiguration for clients. For brokers, we can use standard dynamic broker configs for PEM. was (Author: rsivaram): [~teabot] We currently don't have a way of updating PEM configs for clients unless they are stored externally in a file and the file is reloaded. It may be possible to add a custom `ssl.engine.factory.class` that does reconfiguration for clients. For brokers, we can use standard dynamic broker configs for PEM. > Support PEM format for SSL certificates and private key > --- > > Key: KAFKA-10338 > URL: https://issues.apache.org/jira/browse/KAFKA-10338 > Project: Kafka > Issue Type: New Feature > Components: security >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram >Priority: Major > Fix For: 2.7.0 > > > We currently support only file-based JKS/PKCS12 format for SSL key stores and > trust stores. It will be good to add support for PEM as configuration values > that fits better with config externalization. > KIP: > https://cwiki.apache.org/confluence/display/KAFKA/KIP-651+-+Support+PEM+format+for+SSL+certificates+and+private+key -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-10338) Support PEM format for SSL certificates and private key
[ https://issues.apache.org/jira/browse/KAFKA-10338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17413395#comment-17413395 ] Rajini Sivaram commented on KAFKA-10338: [~teabot] We currently don't have a way of updating PEM configs for clients unless they are stored externally in a file and the file is reloaded. It may be possible to add a custom `ssl.engine.factory.class` that does reconfiguration for clients. For brokers, we can use standard dynamic broker configs for PEM. > Support PEM format for SSL certificates and private key > --- > > Key: KAFKA-10338 > URL: https://issues.apache.org/jira/browse/KAFKA-10338 > Project: Kafka > Issue Type: New Feature > Components: security >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram >Priority: Major > Fix For: 2.7.0 > > > We currently support only file-based JKS/PKCS12 format for SSL key stores and > trust stores. It will be good to add support for PEM as configuration values > that fits better with config externalization. > KIP: > https://cwiki.apache.org/confluence/display/KAFKA/KIP-651+-+Support+PEM+format+for+SSL+certificates+and+private+key -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-13277) Serialization of long tagged string in request/response throws BufferOverflowException
[ https://issues.apache.org/jira/browse/KAFKA-13277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram updated KAFKA-13277: --- Fix Version/s: 2.7.2 > Serialization of long tagged string in request/response throws > BufferOverflowException > -- > > Key: KAFKA-13277 > URL: https://issues.apache.org/jira/browse/KAFKA-13277 > Project: Kafka > Issue Type: Bug > Components: clients >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram >Priority: Blocker > Fix For: 3.0.0, 2.7.2, 2.8.1 > > > Size computation for tagged strings in the message generator is incorrect and > hence it works only for small strings (126 bytes or so) where the length > happens to be correct. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-13277) Serialization of long tagged string in request/response throws BufferOverflowException
[ https://issues.apache.org/jira/browse/KAFKA-13277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram updated KAFKA-13277: --- Affects Version/s: 2.4.1 2.5.1 2.8.0 2.7.1 2.6.2 > Serialization of long tagged string in request/response throws > BufferOverflowException > -- > > Key: KAFKA-13277 > URL: https://issues.apache.org/jira/browse/KAFKA-13277 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 2.4.1, 2.5.1, 2.8.0, 2.7.1, 2.6.2 >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram >Priority: Blocker > Fix For: 3.0.0, 2.7.2, 2.8.1 > > > Size computation for tagged strings in the message generator is incorrect and > hence it works only for small strings (126 bytes or so) where the length > happens to be correct. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-13277) Serialization of long tagged string in request/response throws BufferOverflowException
[ https://issues.apache.org/jira/browse/KAFKA-13277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram updated KAFKA-13277: --- Fix Version/s: 2.8.1 > Serialization of long tagged string in request/response throws > BufferOverflowException > -- > > Key: KAFKA-13277 > URL: https://issues.apache.org/jira/browse/KAFKA-13277 > Project: Kafka > Issue Type: Bug > Components: clients >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram >Priority: Blocker > Fix For: 3.0.0, 2.8.1 > > > Size computation for tagged strings in the message generator is incorrect and > hence it works only for small strings (126 bytes or so) where the length > happens to be correct. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-13277) Serialization of long tagged string in request/response throws BufferOverflowException
[ https://issues.apache.org/jira/browse/KAFKA-13277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram updated KAFKA-13277: --- Fix Version/s: (was: 3.0.1) 3.0.0 > Serialization of long tagged string in request/response throws > BufferOverflowException > -- > > Key: KAFKA-13277 > URL: https://issues.apache.org/jira/browse/KAFKA-13277 > Project: Kafka > Issue Type: Bug > Components: clients >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram >Priority: Major > Fix For: 3.0.0 > > > Size computation for tagged strings in the message generator is incorrect and > hence it works only for small strings (126 bytes or so) where the length > happens to be correct. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-13277) Serialization of long tagged string in request/response throws BufferOverflowException
[ https://issues.apache.org/jira/browse/KAFKA-13277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram updated KAFKA-13277: --- Priority: Blocker (was: Major) > Serialization of long tagged string in request/response throws > BufferOverflowException > -- > > Key: KAFKA-13277 > URL: https://issues.apache.org/jira/browse/KAFKA-13277 > Project: Kafka > Issue Type: Bug > Components: clients >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram >Priority: Blocker > Fix For: 3.0.0 > > > Size computation for tagged strings in the message generator is incorrect and > hence it works only for small strings (126 bytes or so) where the length > happens to be correct. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-13277) Serialization of long tagged string in request/response throws BufferOverflowException
[ https://issues.apache.org/jira/browse/KAFKA-13277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17411311#comment-17411311 ] Rajini Sivaram commented on KAFKA-13277: [~ijuma] Yes, it will be good include in 3.0. > Serialization of long tagged string in request/response throws > BufferOverflowException > -- > > Key: KAFKA-13277 > URL: https://issues.apache.org/jira/browse/KAFKA-13277 > Project: Kafka > Issue Type: Bug > Components: clients >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram >Priority: Major > Fix For: 3.0.1 > > > Size computation for tagged strings in the message generator is incorrect and > hence it works only for small strings (126 bytes or so) where the length > happens to be correct. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-13277) Serialization of long tagged string in request/response throws BufferOverflowException
Rajini Sivaram created KAFKA-13277: -- Summary: Serialization of long tagged string in request/response throws BufferOverflowException Key: KAFKA-13277 URL: https://issues.apache.org/jira/browse/KAFKA-13277 Project: Kafka Issue Type: Bug Components: clients Reporter: Rajini Sivaram Assignee: Rajini Sivaram Fix For: 3.0.1 Size computation for tagged strings in the message generator is incorrect and hence it works only for small strings (126 bytes or so) where the length happens to be correct. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-10774) Support Describe topic using topic IDs
[ https://issues.apache.org/jira/browse/KAFKA-10774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram resolved KAFKA-10774. Fix Version/s: 3.1.0 Reviewer: Rajini Sivaram Resolution: Fixed > Support Describe topic using topic IDs > -- > > Key: KAFKA-10774 > URL: https://issues.apache.org/jira/browse/KAFKA-10774 > Project: Kafka > Issue Type: Sub-task >Reporter: dengziming >Assignee: dengziming >Priority: Major > Fix For: 3.1.0 > > > Similar to KAFKA-10547 which add topic IDs in MetadataResp, we add topic IDs > to MetadataReq and can get TopicDesc by topic IDs -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-13207) Replica fetcher should not update partition state on diverging epoch if partition removed from fetcher
[ https://issues.apache.org/jira/browse/KAFKA-13207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram resolved KAFKA-13207. Resolution: Fixed > Replica fetcher should not update partition state on diverging epoch if > partition removed from fetcher > -- > > Key: KAFKA-13207 > URL: https://issues.apache.org/jira/browse/KAFKA-13207 > Project: Kafka > Issue Type: New Feature > Components: core >Affects Versions: 2.8.0 >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram >Priority: Critical > Fix For: 3.0.0, 2.8.1 > > > {{AbstractFetcherThread#truncateOnFetchResponse}}{color:#24292e} is used with > IBP 2.7 and above to truncate partitions based on diverging epoch returned in > fetch responses. Truncation should only be performed for partitions that are > still owned by the fetcher and this check should be done while holding > {color}{{partitionMapLock}}{color:#24292e} to ensure that any partitions > removed from the fetcher thread are not truncated{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-13207) Replica fetcher should not update partition state on diverging epoch if partition removed from fetcher
Rajini Sivaram created KAFKA-13207: -- Summary: Replica fetcher should not update partition state on diverging epoch if partition removed from fetcher Key: KAFKA-13207 URL: https://issues.apache.org/jira/browse/KAFKA-13207 Project: Kafka Issue Type: New Feature Components: core Affects Versions: 2.8.0 Reporter: Rajini Sivaram Assignee: Rajini Sivaram Fix For: 3.0.0, 2.8.1 {{AbstractFetcherThread#truncateOnFetchResponse}}{color:#24292e} is used with IBP 2.7 and above to truncate partitions based on diverging epoch returned in fetch responses. Truncation should only be performed for partitions that are still owned by the fetcher and this check should be done while holding {color}{{partitionMapLock}}{color:#24292e} to ensure that any partitions removed from the fetcher thread are not truncated{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-13141) Leader should not update follower fetch offset if diverging epoch is present
[ https://issues.apache.org/jira/browse/KAFKA-13141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram resolved KAFKA-13141. Reviewer: Jason Gustafson Resolution: Fixed > Leader should not update follower fetch offset if diverging epoch is present > > > Key: KAFKA-13141 > URL: https://issues.apache.org/jira/browse/KAFKA-13141 > Project: Kafka > Issue Type: Bug >Affects Versions: 2.8.0, 2.7.1 >Reporter: Jason Gustafson >Assignee: Rajini Sivaram >Priority: Blocker > Fix For: 3.0.0, 2.7.2, 2.8.1 > > > In 2.7, we began doing fetcher truncation piggybacked on the Fetch protocol > instead of using the old OffsetsForLeaderEpoch API. When truncation is > detected, we return a `divergingEpoch` field in the Fetch response, but we do > not set an error code. The sender is expected to check if the diverging epoch > is present and truncate accordingly. > All of this works correctly in the fetcher implementation, but the problem is > that the logic to update the follower fetch position on the leader does not > take into account the diverging epoch present in the response. This means the > fetch offsets can be updated incorrectly, which can lead to either log > divergence or the loss of committed data. > For example, we hit the following case with 3 replicas. Leader 1 is elected > in epoch 1 with an end offset of 100. The followers are at offset 101 > Broker 1: (Leader) Epoch 1 from offset 100 > Broker 2: (Follower) Epoch 1 from offset 101 > Broker 3: (Follower) Epoch 1 from offset 101 > Broker 1 receives fetches from 2 and 3 at offset 101. The leader detects the > divergence and returns a diverging epoch in the fetch state. Nevertheless, > the fetch positions for both followers are updated to 101 and the high > watermark is advanced. > After brokers 2 and 3 had truncated to offset 100, broker 1 experienced a > network partition of some kind and was kicked from the ISR. This caused > broker 2 to get elected, which resulted in the following state at the start > of epoch 2. > Broker 1: (Follower) Epoch 2 from offset 101 > Broker 2: (Leader) Epoch 2 from offset 100 > Broker 3: (Follower) Epoch 2 from offset 100 > Broker 2 was then able to write a new entry at offset 100 and the old record > which may have been exposed to consumers was deleted by broker 1. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-13026) Idempotent producer (KAFKA-10619) follow-up testings
[ https://issues.apache.org/jira/browse/KAFKA-13026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram resolved KAFKA-13026. Fix Version/s: 3.0.0 Reviewer: Rajini Sivaram Resolution: Fixed > Idempotent producer (KAFKA-10619) follow-up testings > > > Key: KAFKA-13026 > URL: https://issues.apache.org/jira/browse/KAFKA-13026 > Project: Kafka > Issue Type: Improvement >Reporter: Cheng Tan >Assignee: Cheng Tan >Priority: Major > Fix For: 3.0.0 > > > # Adjust config priority > # Adjust the JUnit tests so we get good coverage of the non-default behavior > # Similar to 2 for system tests -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KAFKA-13026) Idempotent producer (KAFKA-10619) follow-up testings
[ https://issues.apache.org/jira/browse/KAFKA-13026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram reassigned KAFKA-13026: -- Assignee: Cheng Tan > Idempotent producer (KAFKA-10619) follow-up testings > > > Key: KAFKA-13026 > URL: https://issues.apache.org/jira/browse/KAFKA-13026 > Project: Kafka > Issue Type: Improvement >Reporter: Cheng Tan >Assignee: Cheng Tan >Priority: Major > > # Adjust config priority > # Adjust the JUnit tests so we get good coverage of the non-default behavior > # Similar to 2 for system tests -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-13043) Add Admin API for batched offset fetch requests (KIP-709)
Rajini Sivaram created KAFKA-13043: -- Summary: Add Admin API for batched offset fetch requests (KIP-709) Key: KAFKA-13043 URL: https://issues.apache.org/jira/browse/KAFKA-13043 Project: Kafka Issue Type: New Feature Components: admin Affects Versions: 3.0.0 Reporter: Rajini Sivaram Assignee: Sanjana Kaundinya Fix For: 3.1.0 Protocol changes and broker-side changes to process batched OffsetFetchRequests were added under KAFKA-12234. This ticket is to add Admin API changes to use this feature. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-12234) Extend OffsetFetch requests to accept multiple group ids.
[ https://issues.apache.org/jira/browse/KAFKA-12234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram resolved KAFKA-12234. Reviewer: Rajini Sivaram Resolution: Fixed > Extend OffsetFetch requests to accept multiple group ids. > - > > Key: KAFKA-12234 > URL: https://issues.apache.org/jira/browse/KAFKA-12234 > Project: Kafka > Issue Type: Improvement >Affects Versions: 2.8.0 >Reporter: Tom Scott >Assignee: Sanjana Kaundinya >Priority: Minor > Labels: needs-kip > Fix For: 3.0.0 > > > More details are in the KIP: > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=173084258 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-13029) FindCoordinators batching can break consumers during rolling upgrade
[ https://issues.apache.org/jira/browse/KAFKA-13029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram resolved KAFKA-13029. Resolution: Fixed > FindCoordinators batching can break consumers during rolling upgrade > > > Key: KAFKA-13029 > URL: https://issues.apache.org/jira/browse/KAFKA-13029 > Project: Kafka > Issue Type: Bug > Components: consumer >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram >Priority: Blocker > Fix For: 3.0.0 > > > The changes made under KIP-699 assume that it is always safe to use unbatched > mode since we move from batched to unbatched and cache the value forever in > clients if a broker doesn't support batching. During rolling upgrade, if a > request is sent to an older broker, we move from batched to unbatched mode. > The consumer (admin client as well I think) disables batching and future > requests to upgraded brokers will fail because we attempt to use unbatched > requests with a newer version of the request. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-13029) FindCoordinators batching can break consumers during rolling upgrade
[ https://issues.apache.org/jira/browse/KAFKA-13029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17373991#comment-17373991 ] Rajini Sivaram commented on KAFKA-13029: [~mimaison] [~dajac] Consumer and transaction coordinator always have single group, so the batch flag in those with failover are unnecessary. As David mentioned, we just need to set the appropriate data based on versions. The plan is to do the same for batching OffsetFetchRequest as well. For Admin Client, the PR currently switches to unbatched mode if there is an error. But the unbatched mode now works with any version (by setting appropriate data). So while performance could be improved in case an admin client gets into this state using the version of individual brokers, it didn't seem worthwhile making that change now for admin clients. > FindCoordinators batching can break consumers during rolling upgrade > > > Key: KAFKA-13029 > URL: https://issues.apache.org/jira/browse/KAFKA-13029 > Project: Kafka > Issue Type: Bug > Components: consumer >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram >Priority: Blocker > Fix For: 3.0.0 > > > The changes made under KIP-699 assume that it is always safe to use unbatched > mode since we move from batched to unbatched and cache the value forever in > clients if a broker doesn't support batching. During rolling upgrade, if a > request is sent to an older broker, we move from batched to unbatched mode. > The consumer (admin client as well I think) disables batching and future > requests to upgraded brokers will fail because we attempt to use unbatched > requests with a newer version of the request. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-13029) FindCoordinators batching can break consumers during rolling upgrade
Rajini Sivaram created KAFKA-13029: -- Summary: FindCoordinators batching can break consumers during rolling upgrade Key: KAFKA-13029 URL: https://issues.apache.org/jira/browse/KAFKA-13029 Project: Kafka Issue Type: Bug Components: consumer Reporter: Rajini Sivaram Assignee: Rajini Sivaram Fix For: 3.0.0 The changes made under KIP-699 assume that it is always safe to use unbatched mode since we move from batched to unbatched and cache the value forever in clients if a broker doesn't support batching. During rolling upgrade, if a request is sent to an older broker, we move from batched to unbatched mode. The consumer (admin client as well I think) disables batching and future requests to upgraded brokers will fail because we attempt to use unbatched requests with a newer version of the request. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-12996) OffsetOutOfRange not handled correctly for diverging epochs when fetch offset less than leader start offset
[ https://issues.apache.org/jira/browse/KAFKA-12996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram updated KAFKA-12996: --- Fix Version/s: 2.7.2 Affects Version/s: 2.7.1 > OffsetOutOfRange not handled correctly for diverging epochs when fetch offset > less than leader start offset > --- > > Key: KAFKA-12996 > URL: https://issues.apache.org/jira/browse/KAFKA-12996 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.8.0, 2.7.1 >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram >Priority: Major > Fix For: 3.0.0, 2.7.2, 2.8.1 > > > {color:#24292e}If fetchOffset < startOffset, we currently throw > OffsetOutOfRangeException when attempting to read from the log in the regular > case. But for diverging epochs, we return Errors.NONE with the new leader > start offset, hwm etc.. ReplicaFetcherThread throws OffsetOutOfRangeException > when processing responses with Errors.NONE if the leader's offsets in the > response are out of range and this moves the partition to failed state. We > should add a check for this case when processing fetch requests and ensure > OffsetOutOfRangeException is thrown regardless of epochs.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-12996) OffsetOutOfRange not handled correctly for diverging epochs when fetch offset less than leader start offset
[ https://issues.apache.org/jira/browse/KAFKA-12996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram resolved KAFKA-12996. Reviewer: Guozhang Wang Resolution: Fixed > OffsetOutOfRange not handled correctly for diverging epochs when fetch offset > less than leader start offset > --- > > Key: KAFKA-12996 > URL: https://issues.apache.org/jira/browse/KAFKA-12996 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.8.0, 2.7.1 >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram >Priority: Major > Fix For: 3.0.0, 2.7.2, 2.8.1 > > > {color:#24292e}If fetchOffset < startOffset, we currently throw > OffsetOutOfRangeException when attempting to read from the log in the regular > case. But for diverging epochs, we return Errors.NONE with the new leader > start offset, hwm etc.. ReplicaFetcherThread throws OffsetOutOfRangeException > when processing responses with Errors.NONE if the leader's offsets in the > response are out of range and this moves the partition to failed state. We > should add a check for this case when processing fetch requests and ensure > OffsetOutOfRangeException is thrown regardless of epochs.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-12996) OffsetOutOfRange not handled correctly for diverging epochs when fetch offset less than leader start offset
Rajini Sivaram created KAFKA-12996: -- Summary: OffsetOutOfRange not handled correctly for diverging epochs when fetch offset less than leader start offset Key: KAFKA-12996 URL: https://issues.apache.org/jira/browse/KAFKA-12996 Project: Kafka Issue Type: Bug Components: core Affects Versions: 2.8.0 Reporter: Rajini Sivaram Assignee: Rajini Sivaram Fix For: 3.0.0, 2.8.1 {color:#24292e}If fetchOffset < startOffset, we currently throw OffsetOutOfRangeException when attempting to read from the log in the regular case. But for diverging epochs, we return Errors.NONE with the new leader start offset, hwm etc.. ReplicaFetcherThread throws OffsetOutOfRangeException when processing responses with Errors.NONE if the leader's offsets in the response are out of range and this moves the partition to failed state. We should add a check for this case when processing fetch requests and ensure OffsetOutOfRangeException is thrown regardless of epochs.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12948) NetworkClient.close(node) with node in connecting state makes NetworkClient unusable
[ https://issues.apache.org/jira/browse/KAFKA-12948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363490#comment-17363490 ] Rajini Sivaram commented on KAFKA-12948: Yes, it was due to the changes from [https://cwiki.apache.org/confluence/display/KAFKA/KIP-601%3A+Configurable+socket+connection+timeout+in+NetworkClient] , which went into 2.7. > NetworkClient.close(node) with node in connecting state makes NetworkClient > unusable > > > Key: KAFKA-12948 > URL: https://issues.apache.org/jira/browse/KAFKA-12948 > Project: Kafka > Issue Type: Bug > Components: network >Affects Versions: 2.8.0, 2.7.1 >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram >Priority: Critical > Fix For: 2.7.2, 2.8.1 > > > `NetworkClient.close(node)` closes the node and removes it from > `ClusterConnectionStates.nodeState`, but not from > `ClusterConnectionStates.connectingNodes`. Subsequent `NetworkClient.poll()` > invocations throw IllegalStateException and this leaves the NetworkClient in > an unusable state until the node is removed from connectionNodes or added to > nodeState. We don't use `NetworkClient.close(node)` in clients, but we use it > in clients started by brokers for replica fetcher and controller. Since > brokers use NetworkClientUtils.isReady() before establishing connections and > this invokes poll(), the NetworkClient never recovers. > Exception stack trace: > {code:java} > java.lang.IllegalStateException: No entry found for connection 0 > at > org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:409) > at > org.apache.kafka.clients.ClusterConnectionStates.isConnectionSetupTimeout(ClusterConnectionStates.java:446) > at > org.apache.kafka.clients.ClusterConnectionStates.lambda$nodesWithConnectionSetupTimeout$0(ClusterConnectionStates.java:458) > at > java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:174) > at > java.util.HashMap$KeySpliterator.forEachRemaining(HashMap.java:1553) > at > java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) > at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) > at > java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) > at > java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at > java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) > at > org.apache.kafka.clients.ClusterConnectionStates.nodesWithConnectionSetupTimeout(ClusterConnectionStates.java:459) > at > org.apache.kafka.clients.NetworkClient.handleTimedOutConnections(NetworkClient.java:807) > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:564) > at > org.apache.kafka.clients.NetworkClientUtils.isReady(NetworkClientUtils.java:42) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-12948) NetworkClient.close(node) with node in connecting state makes NetworkClient unusable
[ https://issues.apache.org/jira/browse/KAFKA-12948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram resolved KAFKA-12948. Fix Version/s: 3.0.0 Reviewer: David Jacot Resolution: Fixed > NetworkClient.close(node) with node in connecting state makes NetworkClient > unusable > > > Key: KAFKA-12948 > URL: https://issues.apache.org/jira/browse/KAFKA-12948 > Project: Kafka > Issue Type: Bug > Components: network >Affects Versions: 2.8.0, 2.7.1 >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram >Priority: Critical > Fix For: 3.0.0, 2.7.2, 2.8.1 > > > `NetworkClient.close(node)` closes the node and removes it from > `ClusterConnectionStates.nodeState`, but not from > `ClusterConnectionStates.connectingNodes`. Subsequent `NetworkClient.poll()` > invocations throw IllegalStateException and this leaves the NetworkClient in > an unusable state until the node is removed from connectionNodes or added to > nodeState. We don't use `NetworkClient.close(node)` in clients, but we use it > in clients started by brokers for replica fetcher and controller. Since > brokers use NetworkClientUtils.isReady() before establishing connections and > this invokes poll(), the NetworkClient never recovers. > Exception stack trace: > {code:java} > java.lang.IllegalStateException: No entry found for connection 0 > at > org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:409) > at > org.apache.kafka.clients.ClusterConnectionStates.isConnectionSetupTimeout(ClusterConnectionStates.java:446) > at > org.apache.kafka.clients.ClusterConnectionStates.lambda$nodesWithConnectionSetupTimeout$0(ClusterConnectionStates.java:458) > at > java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:174) > at > java.util.HashMap$KeySpliterator.forEachRemaining(HashMap.java:1553) > at > java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) > at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) > at > java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) > at > java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at > java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) > at > org.apache.kafka.clients.ClusterConnectionStates.nodesWithConnectionSetupTimeout(ClusterConnectionStates.java:459) > at > org.apache.kafka.clients.NetworkClient.handleTimedOutConnections(NetworkClient.java:807) > at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:564) > at > org.apache.kafka.clients.NetworkClientUtils.isReady(NetworkClientUtils.java:42) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-12948) NetworkClient.close(node) with node in connecting state makes NetworkClient unusable
Rajini Sivaram created KAFKA-12948: -- Summary: NetworkClient.close(node) with node in connecting state makes NetworkClient unusable Key: KAFKA-12948 URL: https://issues.apache.org/jira/browse/KAFKA-12948 Project: Kafka Issue Type: Bug Components: network Affects Versions: 2.7.1, 2.8.0 Reporter: Rajini Sivaram Assignee: Rajini Sivaram Fix For: 2.7.2, 2.8.1 `NetworkClient.close(node)` closes the node and removes it from `ClusterConnectionStates.nodeState`, but not from `ClusterConnectionStates.connectingNodes`. Subsequent `NetworkClient.poll()` invocations throw IllegalStateException and this leaves the NetworkClient in an unusable state until the node is removed from connectionNodes or added to nodeState. We don't use `NetworkClient.close(node)` in clients, but we use it in clients started by brokers for replica fetcher and controller. Since brokers use NetworkClientUtils.isReady() before establishing connections and this invokes poll(), the NetworkClient never recovers. Exception stack trace: {code:java} java.lang.IllegalStateException: No entry found for connection 0 at org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:409) at org.apache.kafka.clients.ClusterConnectionStates.isConnectionSetupTimeout(ClusterConnectionStates.java:446) at org.apache.kafka.clients.ClusterConnectionStates.lambda$nodesWithConnectionSetupTimeout$0(ClusterConnectionStates.java:458) at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:174) at java.util.HashMap$KeySpliterator.forEachRemaining(HashMap.java:1553) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) at org.apache.kafka.clients.ClusterConnectionStates.nodesWithConnectionSetupTimeout(ClusterConnectionStates.java:459) at org.apache.kafka.clients.NetworkClient.handleTimedOutConnections(NetworkClient.java:807) at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:564) at org.apache.kafka.clients.NetworkClientUtils.isReady(NetworkClientUtils.java:42) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-12867) Trogdor ConsumeBenchWorker quits prematurely with maxMessages config
[ https://issues.apache.org/jira/browse/KAFKA-12867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram resolved KAFKA-12867. Fix Version/s: 3.0.0 Reviewer: Rajini Sivaram Resolution: Fixed > Trogdor ConsumeBenchWorker quits prematurely with maxMessages config > > > Key: KAFKA-12867 > URL: https://issues.apache.org/jira/browse/KAFKA-12867 > Project: Kafka > Issue Type: Bug >Reporter: Kowshik Prakasam >Assignee: Kowshik Prakasam >Priority: Major > Fix For: 3.0.0 > > > The trogdor > [ConsumeBenchWorker|https://github.com/apache/kafka/commits/fc405d792de12a50956195827eaf57bbf6c9/trogdor/src/main/java/org/apache/kafka/trogdor/workload/ConsumeBenchWorker.java] > has a bug. If one of the consumption tasks completes executing successfully > due to [maxMessages being > consumed|https://github.com/apache/kafka/blob/fc405d792de12a50956195827eaf57bbf6c9/trogdor/src/main/java/org/apache/kafka/trogdor/workload/ConsumeBenchWorker.java#L245], > then, the consumption task [notifies the > doneFuture|https://github.com/apache/kafka/blob/fc405d792de12a50956195827eaf57bbf6c9/trogdor/src/main/java/org/apache/kafka/trogdor/workload/ConsumeBenchWorker.java#L285] > causing the ConsumeBenchWorker to halt. This becomes a problem when more > than 1 consumption task is running in parallel, because the successful > completion of 1 of the tasks shuts down the entire worker while the other > tasks are still running. When the worker is shut down, it > [kills|https://github.com/apache/kafka/blob/fc405d792de12a50956195827eaf57bbf6c9/trogdor/src/main/java/org/apache/kafka/trogdor/workload/ConsumeBenchWorker.java#L482] > all the active consumption tasks, which is not the desired behavior. > The fix is to not notify the doneFuture when 1 of the consumption tasks > complete without error. Instead, we should defer the notification to the > [CloseStatusUpdater|https://github.com/apache/kafka/blob/fc405d792de12a50956195827eaf57bbf6c9/trogdor/src/main/java/org/apache/kafka/trogdor/workload/ConsumeBenchWorker.java#L299] > thread. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KAFKA-12867) Trogdor ConsumeBenchWorker quits prematurely with maxMessages config
[ https://issues.apache.org/jira/browse/KAFKA-12867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram reassigned KAFKA-12867: -- Assignee: Kowshik Prakasam > Trogdor ConsumeBenchWorker quits prematurely with maxMessages config > > > Key: KAFKA-12867 > URL: https://issues.apache.org/jira/browse/KAFKA-12867 > Project: Kafka > Issue Type: Bug >Reporter: Kowshik Prakasam >Assignee: Kowshik Prakasam >Priority: Major > > The trogdor > [ConsumeBenchWorker|https://github.com/apache/kafka/commits/fc405d792de12a50956195827eaf57bbf6c9/trogdor/src/main/java/org/apache/kafka/trogdor/workload/ConsumeBenchWorker.java] > has a bug. If one of the consumption tasks completes executing successfully > due to [maxMessages being > consumed|https://github.com/apache/kafka/blob/fc405d792de12a50956195827eaf57bbf6c9/trogdor/src/main/java/org/apache/kafka/trogdor/workload/ConsumeBenchWorker.java#L245], > then, the consumption task [notifies the > doneFuture|https://github.com/apache/kafka/blob/fc405d792de12a50956195827eaf57bbf6c9/trogdor/src/main/java/org/apache/kafka/trogdor/workload/ConsumeBenchWorker.java#L285] > causing the ConsumeBenchWorker to halt. This becomes a problem when more > than 1 consumption task is running in parallel, because the successful > completion of 1 of the tasks shuts down the entire worker while the other > tasks are still running. When the worker is shut down, it > [kills|https://github.com/apache/kafka/blob/fc405d792de12a50956195827eaf57bbf6c9/trogdor/src/main/java/org/apache/kafka/trogdor/workload/ConsumeBenchWorker.java#L482] > all the active consumption tasks, which is not the desired behavior. > The fix is to not notify the doneFuture when 1 of the consumption tasks > complete without error. Instead, we should defer the notification to the > [CloseStatusUpdater|https://github.com/apache/kafka/blob/fc405d792de12a50956195827eaf57bbf6c9/trogdor/src/main/java/org/apache/kafka/trogdor/workload/ConsumeBenchWorker.java#L299] > thread. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-12751) ISRs remain in in-flight state if proposed state is same as actual state
[ https://issues.apache.org/jira/browse/KAFKA-12751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram resolved KAFKA-12751. Reviewer: David Arthur Resolution: Fixed > ISRs remain in in-flight state if proposed state is same as actual state > > > Key: KAFKA-12751 > URL: https://issues.apache.org/jira/browse/KAFKA-12751 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.7.0, 2.8.0, 2.7.1 >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram >Priority: Critical > Fix For: 2.7.2, 2.8.1 > > > If proposed ISR state in an AlterIsr request is the same as the actual state, > Controller returns a successful response without performing any updates. But > the broker code that processes the response leaves the ISR state in in-flight > state without committing. This prevents further ISR updates until the next > leader election. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-12730) A single Kerberos login failure fails all future connections from Java 9 onwards
[ https://issues.apache.org/jira/browse/KAFKA-12730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram updated KAFKA-12730: --- Fix Version/s: 2.8.1 2.7.2 2.6.3 2.5.2 > A single Kerberos login failure fails all future connections from Java 9 > onwards > > > Key: KAFKA-12730 > URL: https://issues.apache.org/jira/browse/KAFKA-12730 > Project: Kafka > Issue Type: Bug > Components: security >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram >Priority: Major > Fix For: 3.0.0, 2.5.2, 2.6.3, 2.7.2, 2.8.1 > > > The refresh thread for Kerberos performs re-login by logging out and then > logging in again. If login fails, we retry after a backoff. Every iteration > of the loop performs loginContext.logout() and loginContext.login(). If login > fails, we end up with two consecutive logouts. This used to work, but from > Java 9 onwards, this results in a NullPointerException due to > https://bugs.openjdk.java.net/browse/JDK-8173069. We should check if logout > is required before attempting logout. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-10727) Kafka clients throw AuthenticationException during Kerberos re-login
[ https://issues.apache.org/jira/browse/KAFKA-10727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram updated KAFKA-10727: --- Fix Version/s: 2.7.2 2.6.3 2.5.2 > Kafka clients throw AuthenticationException during Kerberos re-login > > > Key: KAFKA-10727 > URL: https://issues.apache.org/jira/browse/KAFKA-10727 > Project: Kafka > Issue Type: Bug >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram >Priority: Major > Fix For: 2.5.2, 2.8.0, 2.6.3, 2.7.2 > > > During Kerberos re-login, we log out and login again. There is a timing issue > where the principal in the Subject has been cleared, but a new one hasn't > been populated yet. We need to ensure that we don't throw > AuthenticationException in this case to avoid Kafka clients > (consumer/producer etc.) failing instead of retrying. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-12751) ISRs remain in in-flight state if proposed state is same as actual state
Rajini Sivaram created KAFKA-12751: -- Summary: ISRs remain in in-flight state if proposed state is same as actual state Key: KAFKA-12751 URL: https://issues.apache.org/jira/browse/KAFKA-12751 Project: Kafka Issue Type: Bug Components: core Affects Versions: 2.8.0, 2.7.0, 2.7.1 Reporter: Rajini Sivaram Assignee: Rajini Sivaram Fix For: 2.7.2, 2.8.1 If proposed ISR state in an AlterIsr request is the same as the actual state, Controller returns a successful response without performing any updates. But the broker code that processes the response leaves the ISR state in in-flight state without committing. This prevents further ISR updates until the next leader election. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-12730) A single Kerberos login failure fails all future connections from Java 9 onwards
[ https://issues.apache.org/jira/browse/KAFKA-12730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram resolved KAFKA-12730. Reviewer: Manikumar Resolution: Fixed > A single Kerberos login failure fails all future connections from Java 9 > onwards > > > Key: KAFKA-12730 > URL: https://issues.apache.org/jira/browse/KAFKA-12730 > Project: Kafka > Issue Type: Bug > Components: security >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram >Priority: Major > Fix For: 3.0.0 > > > The refresh thread for Kerberos performs re-login by logging out and then > logging in again. If login fails, we retry after a backoff. Every iteration > of the loop performs loginContext.logout() and loginContext.login(). If login > fails, we end up with two consecutive logouts. This used to work, but from > Java 9 onwards, this results in a NullPointerException due to > https://bugs.openjdk.java.net/browse/JDK-8173069. We should check if logout > is required before attempting logout. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-12730) A single Kerberos login failure fails all future connections from Java 9 onwards
Rajini Sivaram created KAFKA-12730: -- Summary: A single Kerberos login failure fails all future connections from Java 9 onwards Key: KAFKA-12730 URL: https://issues.apache.org/jira/browse/KAFKA-12730 Project: Kafka Issue Type: Bug Components: security Reporter: Rajini Sivaram Assignee: Rajini Sivaram Fix For: 3.0.0 The refresh thread for Kerberos performs re-login by logging out and then logging in again. If login fails, we retry after a backoff. Every iteration of the loop performs loginContext.logout() and loginContext.login(). If login fails, we end up with two consecutive logouts. This used to work, but from Java 9 onwards, this results in a NullPointerException due to https://bugs.openjdk.java.net/browse/JDK-8173069. We should check if logout is required before attempting logout. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12598) Remove deprecated --zookeeper in ConfigCommand
[ https://issues.apache.org/jira/browse/KAFKA-12598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17320471#comment-17320471 ] Rajini Sivaram commented on KAFKA-12598: [~cmccabe] [~hachikuji] Have we figured out how/when to remove ZK options for commands used to bootstrap brokers? > Remove deprecated --zookeeper in ConfigCommand > -- > > Key: KAFKA-12598 > URL: https://issues.apache.org/jira/browse/KAFKA-12598 > Project: Kafka > Issue Type: Sub-task >Reporter: Luke Chen >Assignee: Luke Chen >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-12479) Combine partition offset requests into single request in ConsumerGroupCommand
[ https://issues.apache.org/jira/browse/KAFKA-12479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram resolved KAFKA-12479. Fix Version/s: 3.0.0 Reviewer: David Jacot Resolution: Fixed > Combine partition offset requests into single request in ConsumerGroupCommand > - > > Key: KAFKA-12479 > URL: https://issues.apache.org/jira/browse/KAFKA-12479 > Project: Kafka > Issue Type: Improvement > Components: tools >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram >Priority: Major > Fix For: 3.0.0 > > > We currently send one request per-partition of obtain offset information. For > a group with a large number of partitions, this can take several minutes. It > would be more efficient to send a single request containing all partitions. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-12268) System tests broken because consumer returns early without records
[ https://issues.apache.org/jira/browse/KAFKA-12268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram resolved KAFKA-12268. Resolution: Fixed > System tests broken because consumer returns early without records > --- > > Key: KAFKA-12268 > URL: https://issues.apache.org/jira/browse/KAFKA-12268 > Project: Kafka > Issue Type: Bug > Components: consumer >Reporter: Rajini Sivaram >Assignee: John Roesler >Priority: Critical > Fix For: 2.8.0 > > > https://issues.apache.org/jira/browse/KAFKA-10866 added metadata to > ConsumerRecords. We add metadata even when there are no records. As a result, > we sometimes return early from KafkaConsumer#poll() with no records because > FetchedRecords.isEmpty returns false if either metadata or records are > available. This breaks system tests which rely on poll timeout, expecting > records to be returned on every poll when there is no timeout. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KAFKA-12268) System tests broken because consumer returns early without records
[ https://issues.apache.org/jira/browse/KAFKA-12268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram reassigned KAFKA-12268: -- Assignee: John Roesler (was: Rajini Sivaram) > System tests broken because consumer returns early without records > --- > > Key: KAFKA-12268 > URL: https://issues.apache.org/jira/browse/KAFKA-12268 > Project: Kafka > Issue Type: Bug > Components: consumer >Reporter: Rajini Sivaram >Assignee: John Roesler >Priority: Critical > Fix For: 2.8.0 > > > https://issues.apache.org/jira/browse/KAFKA-10866 added metadata to > ConsumerRecords. We add metadata even when there are no records. As a result, > we sometimes return early from KafkaConsumer#poll() with no records because > FetchedRecords.isEmpty returns false if either metadata or records are > available. This breaks system tests which rely on poll timeout, expecting > records to be returned on every poll when there is no timeout. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-12330) FetchSessionCache may cause starvation for partitions when FetchResponse is full
[ https://issues.apache.org/jira/browse/KAFKA-12330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram resolved KAFKA-12330. Fix Version/s: 3.0.0 Resolution: Fixed > FetchSessionCache may cause starvation for partitions when FetchResponse is > full > > > Key: KAFKA-12330 > URL: https://issues.apache.org/jira/browse/KAFKA-12330 > Project: Kafka > Issue Type: Bug >Reporter: Lucas Bradstreet >Assignee: David Jacot >Priority: Major > Fix For: 3.0.0 > > > The incremental FetchSessionCache sessions deprioritizes partitions where a > response is returned. This may happen if log metadata such as log start > offset, hwm, etc is returned, or if data for that partition is returned. > When a fetch response fills to maxBytes, data may not be returned for > partitions even if the fetch offset is lower than the fetch upper bound. > However, the fetch response will still contain updates to metadata such as > hwm if that metadata has changed. This can lead to degenerate behavior where > a partition's hwm or log start offset is updated resulting in the next fetch > being unnecessarily skipped for that partition. At first this appeared to be > worse, as hwm updates occur frequently, but starvation should result in hwm > movement becoming blocked, allowing a fetch to go through and then becoming > unstuck. However, it'll still require one more fetch request than necessary > to do so. Consumers may be affected more than replica fetchers, however they > often remove partitions with fetched data from the next fetch request and > this may be helping prevent starvation. > I believe we should only reorder the partition fetch priority if data is > actually returned for a partition. > {noformat} > private class PartitionIterator(val iter: FetchSession.RESP_MAP_ITER, > val updateFetchContextAndRemoveUnselected: > Boolean) > extends FetchSession.RESP_MAP_ITER { > var nextElement: util.Map.Entry[TopicPartition, > FetchResponse.PartitionData[Records]] = null > override def hasNext: Boolean = { > while ((nextElement == null) && iter.hasNext) { > val element = iter.next() > val topicPart = element.getKey > val respData = element.getValue > val cachedPart = session.partitionMap.find(new > CachedPartition(topicPart)) > val mustRespond = cachedPart.maybeUpdateResponseData(respData, > updateFetchContextAndRemoveUnselected) > if (mustRespond) { > nextElement = element > // Example POC change: > // Don't move partition to end of queue if we didn't actually fetch > data > // This should help avoid starvation even when we are filling the > fetch response fully while returning metadata for these partitions > if (updateFetchContextAndRemoveUnselected && respData.records != null > && respData.records.sizeInBytes > 0) { > session.partitionMap.remove(cachedPart) > session.partitionMap.mustAdd(cachedPart) > } > } else { > if (updateFetchContextAndRemoveUnselected) { > iter.remove() > } > } > } > nextElement != null > }{noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-12427) Broker does not close muted idle connections with buffered data
[ https://issues.apache.org/jira/browse/KAFKA-12427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram resolved KAFKA-12427. Fix Version/s: 3.0.0 Reviewer: Rajini Sivaram Resolution: Fixed > Broker does not close muted idle connections with buffered data > --- > > Key: KAFKA-12427 > URL: https://issues.apache.org/jira/browse/KAFKA-12427 > Project: Kafka > Issue Type: Bug > Components: core, network >Reporter: David Mao >Assignee: David Mao >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-12479) Combine partition offset requests into single request in ConsumerGroupCommand
Rajini Sivaram created KAFKA-12479: -- Summary: Combine partition offset requests into single request in ConsumerGroupCommand Key: KAFKA-12479 URL: https://issues.apache.org/jira/browse/KAFKA-12479 Project: Kafka Issue Type: Improvement Components: tools Reporter: Rajini Sivaram Assignee: Rajini Sivaram We currently send one request per-partition of obtain offset information. For a group with a large number of partitions, this can take several minutes. It would be more efficient to send a single request containing all partitions. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12431) Fetch Request/Response without Topic information
[ https://issues.apache.org/jira/browse/KAFKA-12431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297989#comment-17297989 ] Rajini Sivaram commented on KAFKA-12431: I can't think of any changes in 2.5.x or 2.6.x that explains this behaviour. Looking at the code, I think we don't handle arithmetic overflows. So zero quota may happen to result in no quota enforcement as a result of overflows that set throttleTimeMs to a negative value. But this code doesn't seem to have changed much since the older versions. Empty fetch response for quota violation was introduced in 2.0.0 under KIP-219. In 2.6.0, we improved on this by sending partial responses under KAFKA-9677. That may not help with very low quotas, but shouldn't have caused any degradation compared to before. In any case, this shouldn't have altered the handling of quota=0. > Fetch Request/Response without Topic information > > > Key: KAFKA-12431 > URL: https://issues.apache.org/jira/browse/KAFKA-12431 > Project: Kafka > Issue Type: Bug >Affects Versions: 2.6.1 >Reporter: Peter Sinoros-Szabo >Priority: Major > Attachments: fetch-on-2.4.1.png, fetch-on-2.6.1.png, > kafka-highcpu-24.svg.zip, kafka-highcpu-26.svg.zip > > > I was running a 6 node Kafka 2.4.1 cluster with protocol and message format > version set to 2.4. I wanted to upgrade the cluster to 2.6.1 and after I > upgraded the 1st broker to 2.6.1 without any configuration change, I noticed > much higher CPU usage on that broker (instead of 25% CPU usage it was ~350%) > and about 3-4x higher network traffic. So I dumped the traffic between the > Kafka client and broker and compared it with the traffic of the same broker > downgraded to 2.4.1. > It seems to me that after I upgraded to 2.6.1, the Fetch requests and > responses are not complete, it is missing the topics part of the Fetch > Request, I don't know for what reason. I guess there should be always a > topics part. > I'll attache a screenshot from these messages. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-12254) MirrorMaker 2.0 creates destination topic with default configs
[ https://issues.apache.org/jira/browse/KAFKA-12254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram updated KAFKA-12254: --- Fix Version/s: (was: 3.0.0) > MirrorMaker 2.0 creates destination topic with default configs > -- > > Key: KAFKA-12254 > URL: https://issues.apache.org/jira/browse/KAFKA-12254 > Project: Kafka > Issue Type: Bug >Affects Versions: 2.4.0, 2.5.0, 2.4.1, 2.6.0, 2.5.1, 2.7.0, 2.6.1, 2.8.0 >Reporter: Dhruvil Shah >Assignee: Dhruvil Shah >Priority: Blocker > Fix For: 2.8.0 > > > `MirrorSourceConnector` implements the logic for replicating data, > configurations, and other metadata between the source and destination > clusters. This includes the tasks below: > # `refreshTopicPartitions` for syncing topics / partitions from source to > destination. > # `syncTopicConfigs` for syncing topic configurations from source to > destination. > A limitation is that `computeAndCreateTopicPartitions` creates topics with > default configurations on the destination cluster. A separate async task > `syncTopicConfigs` is responsible for syncing the topic configs. Before that > sync happens, topic configurations could be out of sync between the two > clusters. > In the worst case, this could lead to data loss eg. when we have a compacted > topic being mirrored between clusters which is incorrectly created with the > default configuration of `cleanup.policy = delete` on the destination before > the configurations are sync'd via `syncTopicConfigs`. > Here is an example of the divergence: > Source Topic: > ``` > Topic: foobar PartitionCount: 1 ReplicationFactor: 1 Configs: > cleanup.policy=compact,segment.bytes=1073741824 > ``` > Destination Topic: > ``` > Topic: A.foobar PartitionCount: 1 ReplicationFactor: 1 Configs: > segment.bytes=1073741824 > ``` > A safer approach is to ensure that the right configurations are set on the > destination cluster before data is replicated to it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-10700) Support mutual TLS authentication for SASL_SSL listeners
[ https://issues.apache.org/jira/browse/KAFKA-10700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram resolved KAFKA-10700. Reviewer: David Jacot Resolution: Fixed > Support mutual TLS authentication for SASL_SSL listeners > > > Key: KAFKA-10700 > URL: https://issues.apache.org/jira/browse/KAFKA-10700 > Project: Kafka > Issue Type: New Feature > Components: security >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram >Priority: Major > Fix For: 2.8.0 > > > See > https://cwiki.apache.org/confluence/display/KAFKA/KIP-684+-+Support+mutual+TLS+authentication+on+SASL_SSL+listeners > for details -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-12268) System tests broken because consumer returns early without records
Rajini Sivaram created KAFKA-12268: -- Summary: System tests broken because consumer returns early without records Key: KAFKA-12268 URL: https://issues.apache.org/jira/browse/KAFKA-12268 Project: Kafka Issue Type: Bug Components: consumer Reporter: Rajini Sivaram Assignee: Rajini Sivaram Fix For: 2.8.0 https://issues.apache.org/jira/browse/KAFKA-10866 added metadata to ConsumerRecords. We add metadata even when there are no records. As a result, we sometimes return early from KafkaConsumer#poll() with no records because FetchedRecords.isEmpty returns false if either metadata or records are available. This breaks system tests which rely on poll timeout, expecting records to be returned on every poll when there is no timeout. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-10798) Failed authentication delay doesn't work with some SASL authentication failures
[ https://issues.apache.org/jira/browse/KAFKA-10798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram updated KAFKA-10798: --- Fix Version/s: 2.6.2 2.7.1 > Failed authentication delay doesn't work with some SASL authentication > failures > --- > > Key: KAFKA-10798 > URL: https://issues.apache.org/jira/browse/KAFKA-10798 > Project: Kafka > Issue Type: Bug > Components: security >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram >Priority: Major > Fix For: 2.8.0, 2.7.1, 2.6.2 > > > KIP-306 introduced the config `connection.failed.authentication.delay.ms` to > delay connection closing on brokers for failed authentication to limit the > rate of retried authentications from clients in order to avoid excessive > authentication load on brokers from failed clients. We rely on authentication > failure response to be delayed in this case to prevent clients from detecting > the failure and retrying sooner. > SaslServerAuthenticator delays response for SaslAuthenticationException, but > not for SaslException, even though SaslException is also converted into > SaslAuthenticationException and processed as an authentication failure by > both server and clients. As a result, connection delay is not applied in many > scenarios like SCRAM authentication failures. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-10798) Failed authentication delay doesn't work with some SASL authentication failures
[ https://issues.apache.org/jira/browse/KAFKA-10798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram resolved KAFKA-10798. Reviewer: Manikumar Resolution: Fixed > Failed authentication delay doesn't work with some SASL authentication > failures > --- > > Key: KAFKA-10798 > URL: https://issues.apache.org/jira/browse/KAFKA-10798 > Project: Kafka > Issue Type: Bug > Components: security >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram >Priority: Major > Fix For: 2.8.0 > > > KIP-306 introduced the config `connection.failed.authentication.delay.ms` to > delay connection closing on brokers for failed authentication to limit the > rate of retried authentications from clients in order to avoid excessive > authentication load on brokers from failed clients. We rely on authentication > failure response to be delayed in this case to prevent clients from detecting > the failure and retrying sooner. > SaslServerAuthenticator delays response for SaslAuthenticationException, but > not for SaslException, even though SaslException is also converted into > SaslAuthenticationException and processed as an authentication failure by > both server and clients. As a result, connection delay is not applied in many > scenarios like SCRAM authentication failures. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-10554) Perform follower truncation based on epoch offsets returned in Fetch response
[ https://issues.apache.org/jira/browse/KAFKA-10554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram resolved KAFKA-10554. Reviewer: Jason Gustafson Resolution: Fixed > Perform follower truncation based on epoch offsets returned in Fetch response > - > > Key: KAFKA-10554 > URL: https://issues.apache.org/jira/browse/KAFKA-10554 > Project: Kafka > Issue Type: Task > Components: replication >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram >Priority: Major > Fix For: 2.8.0 > > > KAFKA-10435 updated fetch protocol for KIP-595 to return diverging epoch and > offset as part of fetch response. We can use this to truncate logs in > followers while processing fetch responses. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-10798) Failed authentication delay doesn't work with some SASL authentication failures
Rajini Sivaram created KAFKA-10798: -- Summary: Failed authentication delay doesn't work with some SASL authentication failures Key: KAFKA-10798 URL: https://issues.apache.org/jira/browse/KAFKA-10798 Project: Kafka Issue Type: Bug Components: security Reporter: Rajini Sivaram Assignee: Rajini Sivaram Fix For: 2.8.0 KIP-306 introduced the config `connection.failed.authentication.delay.ms` to delay connection closing on brokers for failed authentication to limit the rate of retried authentications from clients in order to avoid excessive authentication load on brokers from failed clients. We rely on authentication failure response to be delayed in this case to prevent clients from detecting the failure and retrying sooner. SaslServerAuthenticator delays response for SaslAuthenticationException, but not for SaslException, even though SaslException is also converted into SaslAuthenticationException and processed as an authentication failure by both server and clients. As a result, connection delay is not applied in many scenarios like SCRAM authentication failures. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-10727) Kafka clients throw AuthenticationException during Kerberos re-login
[ https://issues.apache.org/jira/browse/KAFKA-10727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram resolved KAFKA-10727. Fix Version/s: 2.8.0 Reviewer: Manikumar Resolution: Fixed > Kafka clients throw AuthenticationException during Kerberos re-login > > > Key: KAFKA-10727 > URL: https://issues.apache.org/jira/browse/KAFKA-10727 > Project: Kafka > Issue Type: Bug >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram >Priority: Major > Fix For: 2.8.0 > > > During Kerberos re-login, we log out and login again. There is a timing issue > where the principal in the Subject has been cleared, but a new one hasn't > been populated yet. We need to ensure that we don't throw > AuthenticationException in this case to avoid Kafka clients > (consumer/producer etc.) failing instead of retrying. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-10727) Kafka clients throw AuthenticationException during Kerberos re-login
Rajini Sivaram created KAFKA-10727: -- Summary: Kafka clients throw AuthenticationException during Kerberos re-login Key: KAFKA-10727 URL: https://issues.apache.org/jira/browse/KAFKA-10727 Project: Kafka Issue Type: Bug Reporter: Rajini Sivaram Assignee: Rajini Sivaram During Kerberos re-login, we log out and login again. There is a timing issue where the principal in the Subject has been cleared, but a new one hasn't been populated yet. We need to ensure that we don't throw AuthenticationException in this case to avoid Kafka clients (consumer/producer etc.) failing instead of retrying. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-10700) Support mutual TLS authentication for SASL_SSL listeners
Rajini Sivaram created KAFKA-10700: -- Summary: Support mutual TLS authentication for SASL_SSL listeners Key: KAFKA-10700 URL: https://issues.apache.org/jira/browse/KAFKA-10700 Project: Kafka Issue Type: New Feature Components: security Reporter: Rajini Sivaram Assignee: Rajini Sivaram Fix For: 2.8.0 See https://cwiki.apache.org/confluence/display/KAFKA/KIP-684+-+Support+mutual+TLS+authentication+on+SASL_SSL+listeners for details -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-7987) a broker's ZK session may die on transient auth failure
[ https://issues.apache.org/jira/browse/KAFKA-7987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram resolved KAFKA-7987. --- Fix Version/s: 2.8.0 Reviewer: Jun Rao Resolution: Fixed > a broker's ZK session may die on transient auth failure > --- > > Key: KAFKA-7987 > URL: https://issues.apache.org/jira/browse/KAFKA-7987 > Project: Kafka > Issue Type: Bug >Reporter: Jun Rao >Priority: Critical > Fix For: 2.8.0 > > > After a transient network issue, we saw the following log in a broker. > {code:java} > [23:37:02,102] ERROR SASL authentication with Zookeeper Quorum member failed: > javax.security.sasl.SaslException: An error: > (java.security.PrivilegedActionException: javax.security.sasl.SaslException: > GSS initiate failed [Caused by GSSException: No valid credentials provided > (Mechanism level: Server not found in Kerberos database (7))]) occurred when > evaluating Zookeeper Quorum Member's received SASL token. Zookeeper Client > will go to AUTH_FAILED state. (org.apache.zookeeper.ClientCnxn) > [23:37:02,102] ERROR [ZooKeeperClient] Auth failed. > (kafka.zookeeper.ZooKeeperClient) > {code} > The network issue prevented the broker from communicating to ZK. The broker's > ZK session then expired, but the broker didn't know that yet since it > couldn't establish a connection to ZK. When the network was back, the broker > tried to establish a connection to ZK, but failed due to auth failure (likely > due to a transient KDC issue). The current logic just ignores the auth > failure without trying to create a new ZK session. Then the broker will be > permanently in a state that it's alive, but not registered in ZK. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-7987) a broker's ZK session may die on transient auth failure
[ https://issues.apache.org/jira/browse/KAFKA-7987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221756#comment-17221756 ] Rajini Sivaram commented on KAFKA-7987: --- [~junrao]Thank you! I will follow up on the PR. > a broker's ZK session may die on transient auth failure > --- > > Key: KAFKA-7987 > URL: https://issues.apache.org/jira/browse/KAFKA-7987 > Project: Kafka > Issue Type: Bug >Reporter: Jun Rao >Priority: Critical > > After a transient network issue, we saw the following log in a broker. > {code:java} > [23:37:02,102] ERROR SASL authentication with Zookeeper Quorum member failed: > javax.security.sasl.SaslException: An error: > (java.security.PrivilegedActionException: javax.security.sasl.SaslException: > GSS initiate failed [Caused by GSSException: No valid credentials provided > (Mechanism level: Server not found in Kerberos database (7))]) occurred when > evaluating Zookeeper Quorum Member's received SASL token. Zookeeper Client > will go to AUTH_FAILED state. (org.apache.zookeeper.ClientCnxn) > [23:37:02,102] ERROR [ZooKeeperClient] Auth failed. > (kafka.zookeeper.ZooKeeperClient) > {code} > The network issue prevented the broker from communicating to ZK. The broker's > ZK session then expired, but the broker didn't know that yet since it > couldn't establish a connection to ZK. When the network was back, the broker > tried to establish a connection to ZK, but failed due to auth failure (likely > due to a transient KDC issue). The current logic just ignores the auth > failure without trying to create a new ZK session. Then the broker will be > permanently in a state that it's alive, but not registered in ZK. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-7987) a broker's ZK session may die on transient auth failure
[ https://issues.apache.org/jira/browse/KAFKA-7987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram updated KAFKA-7987: -- Issue Type: Bug (was: Improvement) Priority: Critical (was: Major) > a broker's ZK session may die on transient auth failure > --- > > Key: KAFKA-7987 > URL: https://issues.apache.org/jira/browse/KAFKA-7987 > Project: Kafka > Issue Type: Bug >Reporter: Jun Rao >Priority: Critical > > After a transient network issue, we saw the following log in a broker. > {code:java} > [23:37:02,102] ERROR SASL authentication with Zookeeper Quorum member failed: > javax.security.sasl.SaslException: An error: > (java.security.PrivilegedActionException: javax.security.sasl.SaslException: > GSS initiate failed [Caused by GSSException: No valid credentials provided > (Mechanism level: Server not found in Kerberos database (7))]) occurred when > evaluating Zookeeper Quorum Member's received SASL token. Zookeeper Client > will go to AUTH_FAILED state. (org.apache.zookeeper.ClientCnxn) > [23:37:02,102] ERROR [ZooKeeperClient] Auth failed. > (kafka.zookeeper.ZooKeeperClient) > {code} > The network issue prevented the broker from communicating to ZK. The broker's > ZK session then expired, but the broker didn't know that yet since it > couldn't establish a connection to ZK. When the network was back, the broker > tried to establish a connection to ZK, but failed due to auth failure (likely > due to a transient KDC issue). The current logic just ignores the auth > failure without trying to create a new ZK session. Then the broker will be > permanently in a state that it's alive, but not registered in ZK. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-7987) a broker's ZK session may die on transient auth failure
[ https://issues.apache.org/jira/browse/KAFKA-7987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221328#comment-17221328 ] Rajini Sivaram commented on KAFKA-7987: --- [~junrao] This is still an open issue for all versions of Kafka, right? I am looking into an authorizer issue where authorizer notifications were not processed for a long time. Heap dump shows that the authorizer's ZookeeperClient is in AUTH_FAILED state. ZK is Kerberos-enabled and there are a couple of authentication failures in the logs due to clock-skew errors, which look like the reason why the authorizer's ZooKeeperClient got into this state. For the authorizer, we do need to schedule retries in this case. But the issue doesn't seem to have affected other operations of the broker in this case, presumably because we retry connections for the other ZooKeeperClient when there are requests. We should still apply the retry fix to the common ZooKeeperClient, right? Since the affected broker didn't pick up ACL updates made on other brokers, it is a critical security issue. But I wanted to check if we have applied any fixes in this area in newer versions of Kafka. Thanks. > a broker's ZK session may die on transient auth failure > --- > > Key: KAFKA-7987 > URL: https://issues.apache.org/jira/browse/KAFKA-7987 > Project: Kafka > Issue Type: Improvement >Reporter: Jun Rao >Priority: Major > > After a transient network issue, we saw the following log in a broker. > {code:java} > [23:37:02,102] ERROR SASL authentication with Zookeeper Quorum member failed: > javax.security.sasl.SaslException: An error: > (java.security.PrivilegedActionException: javax.security.sasl.SaslException: > GSS initiate failed [Caused by GSSException: No valid credentials provided > (Mechanism level: Server not found in Kerberos database (7))]) occurred when > evaluating Zookeeper Quorum Member's received SASL token. Zookeeper Client > will go to AUTH_FAILED state. (org.apache.zookeeper.ClientCnxn) > [23:37:02,102] ERROR [ZooKeeperClient] Auth failed. > (kafka.zookeeper.ZooKeeperClient) > {code} > The network issue prevented the broker from communicating to ZK. The broker's > ZK session then expired, but the broker didn't know that yet since it > couldn't establish a connection to ZK. When the network was back, the broker > tried to establish a connection to ZK, but failed due to auth failure (likely > due to a transient KDC issue). The current logic just ignores the auth > failure without trying to create a new ZK session. Then the broker will be > permanently in a state that it's alive, but not registered in ZK. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-10626) Add support for max.in.flight.requests.per.connection behaviour in MockClient
Rajini Sivaram created KAFKA-10626: -- Summary: Add support for max.in.flight.requests.per.connection behaviour in MockClient Key: KAFKA-10626 URL: https://issues.apache.org/jira/browse/KAFKA-10626 Project: Kafka Issue Type: Task Components: unit tests Reporter: Rajini Sivaram We currently don't have an easy way to test max.in.flight.requests.per.connection behaviour in unit tests. It will be useful to add this for tests like the one in KAFKA-10520 with max.in.flight=1. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-10233) KafkaConsumer polls in a tight loop if group is not authorized
[ https://issues.apache.org/jira/browse/KAFKA-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram updated KAFKA-10233: --- Fix Version/s: (was: 2.7.0) 2.8.0 > KafkaConsumer polls in a tight loop if group is not authorized > -- > > Key: KAFKA-10233 > URL: https://issues.apache.org/jira/browse/KAFKA-10233 > Project: Kafka > Issue Type: Bug > Components: consumer >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram >Priority: Major > Fix For: 2.8.0 > > > Consumer propagates GroupAuthorizationException from poll immediately when > trying to find coordinator even though it is a retriable exception. If the > application polls in a loop, ignoring retriable exceptions, the consumer > tries to find coordinator in a tight loop without any backoff. We should > apply retry backoff in this case to avoid overloading brokers. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-10520) InitProducerId may be blocked if least loaded node is not ready to send
[ https://issues.apache.org/jira/browse/KAFKA-10520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209737#comment-17209737 ] Rajini Sivaram commented on KAFKA-10520: [~ableegoldman] Yes, will try and get this done in time for 2.7.0 code freeze. > InitProducerId may be blocked if least loaded node is not ready to send > --- > > Key: KAFKA-10520 > URL: https://issues.apache.org/jira/browse/KAFKA-10520 > Project: Kafka > Issue Type: Bug > Components: producer >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram >Priority: Major > Fix For: 2.7.0 > > > From the logs of a failing producer that shows InitProducerId timing out > after request timeout, it looks like we don't poll while waiting for > transactional producer to be initialized and FindCoordinator request cannot > be sent. The producer configuration used one bootstrap server and > `max.in.flight.requests.per.connection=1`. The failing sequence: > # Producer sends MetadataRequest to least loaded node (bootstrap server) > # Producer is ready to send InitProducerId, needs to find transaction > coordinator > # Producer creates FindCoordinator request, but the only node known is the > bootstrap server. Producer cannot send to this node since there is already > the Metadata request in flight and max.inflight is 1. > # Producer waits without polling, so Metadata response is not processed. > InitProducerId times out eventually. > > > We need to update the condition used to determine whether Sender should > poll() to fix this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KAFKA-10520) InitProducerId may be blocked if least loaded node is not ready to send
[ https://issues.apache.org/jira/browse/KAFKA-10520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram reassigned KAFKA-10520: -- Assignee: Rajini Sivaram > InitProducerId may be blocked if least loaded node is not ready to send > --- > > Key: KAFKA-10520 > URL: https://issues.apache.org/jira/browse/KAFKA-10520 > Project: Kafka > Issue Type: Bug > Components: producer >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram >Priority: Major > Fix For: 2.7.0 > > > From the logs of a failing producer that shows InitProducerId timing out > after request timeout, it looks like we don't poll while waiting for > transactional producer to be initialized and FindCoordinator request cannot > be sent. The producer configuration used one bootstrap server and > `max.in.flight.requests.per.connection=1`. The failing sequence: > # Producer sends MetadataRequest to least loaded node (bootstrap server) > # Producer is ready to send InitProducerId, needs to find transaction > coordinator > # Producer creates FindCoordinator request, but the only node known is the > bootstrap server. Producer cannot send to this node since there is already > the Metadata request in flight and max.inflight is 1. > # Producer waits without polling, so Metadata response is not processed. > InitProducerId times out eventually. > > > We need to update the condition used to determine whether Sender should > poll() to fix this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-9497) Brokers start up even if SASL provider is not loaded and throw NPE when clients connect
[ https://issues.apache.org/jira/browse/KAFKA-9497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram resolved KAFKA-9497. --- Resolution: Duplicate > Brokers start up even if SASL provider is not loaded and throw NPE when > clients connect > --- > > Key: KAFKA-9497 > URL: https://issues.apache.org/jira/browse/KAFKA-9497 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.10.2.2, 0.11.0.3, 1.1.1, 2.4.0 >Reporter: Rajini Sivaram >Assignee: Ron Dagostino >Priority: Major > Fix For: 2.7.0 > > > Note: This is not a regression, this has been the behaviour since SASL was > first implemented in Kafka. > > Sasl.createSaslServer and Sasl.createSaslClient may return null if a SASL > provider that works for the specified configs cannot be created. We don't > currently handle this case. As a result broker/client throws > NullPointerException if a provider has not been loaded. On the broker-side, > we allow brokers to start up successfully even if SASL provider for its > enabled mechanisms are not found. For SASL mechanisms > PLAIN/SCRAM-xx/OAUTHBEARER, the login module in Kafka loads the SASL > providers. If the login module is incorrectly configured, brokers startup and > then fail client connections when hitting NPE. Clients see disconnections > during authentication as a result. It is difficult to tell from the client or > broker logs why the failure occurred. We should fail during startup if SASL > providers are not found and provide better diagnostics for this case. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KAFKA-9497) Brokers start up even if SASL provider is not loaded and throw NPE when clients connect
[ https://issues.apache.org/jira/browse/KAFKA-9497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram reassigned KAFKA-9497: - Assignee: Ron Dagostino (was: Rajini Sivaram) > Brokers start up even if SASL provider is not loaded and throw NPE when > clients connect > --- > > Key: KAFKA-9497 > URL: https://issues.apache.org/jira/browse/KAFKA-9497 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.10.2.2, 0.11.0.3, 1.1.1, 2.4.0 >Reporter: Rajini Sivaram >Assignee: Ron Dagostino >Priority: Major > Fix For: 2.7.0 > > > Note: This is not a regression, this has been the behaviour since SASL was > first implemented in Kafka. > > Sasl.createSaslServer and Sasl.createSaslClient may return null if a SASL > provider that works for the specified configs cannot be created. We don't > currently handle this case. As a result broker/client throws > NullPointerException if a provider has not been loaded. On the broker-side, > we allow brokers to start up successfully even if SASL provider for its > enabled mechanisms are not found. For SASL mechanisms > PLAIN/SCRAM-xx/OAUTHBEARER, the login module in Kafka loads the SASL > providers. If the login module is incorrectly configured, brokers startup and > then fail client connections when hitting NPE. Clients see disconnections > during authentication as a result. It is difficult to tell from the client or > broker logs why the failure occurred. We should fail during startup if SASL > providers are not found and provide better diagnostics for this case. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-10338) Support PEM format for SSL certificates and private key
[ https://issues.apache.org/jira/browse/KAFKA-10338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram updated KAFKA-10338: --- Description: We currently support only file-based JKS/PKCS12 format for SSL key stores and trust stores. It will be good to add support for PEM as configuration values that fits better with config externalization. KIP: https://cwiki.apache.org/confluence/display/KAFKA/KIP-651+-+Support+PEM+format+for+SSL+certificates+and+private+key was:We currently support only file-based JKS/PKCS12 format for SSL key stores and trust stores. It will be good to add support for PEM as configuration values that fits better with config externalization. > Support PEM format for SSL certificates and private key > --- > > Key: KAFKA-10338 > URL: https://issues.apache.org/jira/browse/KAFKA-10338 > Project: Kafka > Issue Type: New Feature > Components: security >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram >Priority: Major > Fix For: 2.7.0 > > > We currently support only file-based JKS/PKCS12 format for SSL key stores and > trust stores. It will be good to add support for PEM as configuration values > that fits better with config externalization. > KIP: > https://cwiki.apache.org/confluence/display/KAFKA/KIP-651+-+Support+PEM+format+for+SSL+certificates+and+private+key -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-10338) Support PEM format for SSL certificates and private key
[ https://issues.apache.org/jira/browse/KAFKA-10338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram resolved KAFKA-10338. Fix Version/s: 2.7.0 Reviewer: Manikumar Resolution: Fixed > Support PEM format for SSL certificates and private key > --- > > Key: KAFKA-10338 > URL: https://issues.apache.org/jira/browse/KAFKA-10338 > Project: Kafka > Issue Type: New Feature > Components: security >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram >Priority: Major > Fix For: 2.7.0 > > > We currently support only file-based JKS/PKCS12 format for SSL key stores and > trust stores. It will be good to add support for PEM as configuration values > that fits better with config externalization. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-10554) Perform follower truncation based on epoch offsets returned in Fetch response
Rajini Sivaram created KAFKA-10554: -- Summary: Perform follower truncation based on epoch offsets returned in Fetch response Key: KAFKA-10554 URL: https://issues.apache.org/jira/browse/KAFKA-10554 Project: Kafka Issue Type: Task Components: replication Reporter: Rajini Sivaram Assignee: Rajini Sivaram Fix For: 2.7.0 KAFKA-10435 updated fetch protocol for KIP-595 to return diverging epoch and offset as part of fetch response. We can use this to truncate logs in followers while processing fetch responses. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-10520) InitProducerId may be blocked if least loaded node is not ready to send
[ https://issues.apache.org/jira/browse/KAFKA-10520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajini Sivaram updated KAFKA-10520: --- Description: >From the logs of a failing producer that shows InitProducerId timing out after >request timeout, it looks like we don't poll while waiting for transactional >producer to be initialized and FindCoordinator request cannot be sent. The >producer configuration used one bootstrap server and >`max.in.flight.requests.per.connection=1`. The failing sequence: # Producer sends MetadataRequest to least loaded node (bootstrap server) # Producer is ready to send InitProducerId, needs to find transaction coordinator # Producer creates FindCoordinator request, but the only node known is the bootstrap server. Producer cannot send to this node since there is already the Metadata request in flight and max.inflight is 1. # Producer waits without polling, so Metadata response is not processed. InitProducerId times out eventually. We need to update the condition used to determine whether Sender should poll() to fix this issue. was: >From the logs of a failing producer that shows InitProducerId timing out after >request timeout, it looks like we don't poll while waiting for transactional >producer to be initialized and FindCoordinator request cannot be sent. The >producer configuration used one bootstrap server and >{color:#172b4d}`{color}{color:#067d17}{color:#172b4d}max.in.flight.requests.per.connection=1`. > The failing sequence:{color}{color} # {color:#067d17}{color:#172b4d}Producer sends MetadataRequest to least loaded node (bootstrap server){color} {color} # {color:#067d17}{color:#172b4d}Producer is ready to send InitProducerId, needs to find transaction coordinator{color}{color} # {color:#067d17}{color:#172b4d}Producer creates FindCoordinator request, but the only node known is the bootstrap server. Producer cannot send to this node since there is already the Metadata request in flight and max.inflight is 1.{color}{color} # {color:#067d17}{color:#172b4d}Producer waits without polling, so Metadata response is not processed. InitProducerId times out eventually.{color}{color} {color:#067d17}{color:#172b4d}We need to update the condition used to determine whether Sender should poll() to fix this issue.{color}{color} > InitProducerId may be blocked if least loaded node is not ready to send > --- > > Key: KAFKA-10520 > URL: https://issues.apache.org/jira/browse/KAFKA-10520 > Project: Kafka > Issue Type: Bug > Components: producer >Reporter: Rajini Sivaram >Priority: Major > Fix For: 2.7.0 > > > From the logs of a failing producer that shows InitProducerId timing out > after request timeout, it looks like we don't poll while waiting for > transactional producer to be initialized and FindCoordinator request cannot > be sent. The producer configuration used one bootstrap server and > `max.in.flight.requests.per.connection=1`. The failing sequence: > # Producer sends MetadataRequest to least loaded node (bootstrap server) > # Producer is ready to send InitProducerId, needs to find transaction > coordinator > # Producer creates FindCoordinator request, but the only node known is the > bootstrap server. Producer cannot send to this node since there is already > the Metadata request in flight and max.inflight is 1. > # Producer waits without polling, so Metadata response is not processed. > InitProducerId times out eventually. > > > We need to update the condition used to determine whether Sender should > poll() to fix this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-10520) InitProducerId may be blocked if least loaded node is not ready to send
Rajini Sivaram created KAFKA-10520: -- Summary: InitProducerId may be blocked if least loaded node is not ready to send Key: KAFKA-10520 URL: https://issues.apache.org/jira/browse/KAFKA-10520 Project: Kafka Issue Type: Bug Components: producer Reporter: Rajini Sivaram Fix For: 2.7.0 >From the logs of a failing producer that shows InitProducerId timing out after >request timeout, it looks like we don't poll while waiting for transactional >producer to be initialized and FindCoordinator request cannot be sent. The >producer configuration used one bootstrap server and >{color:#172b4d}`{color}{color:#067d17}{color:#172b4d}max.in.flight.requests.per.connection=1`. > The failing sequence:{color}{color} # {color:#067d17}{color:#172b4d}Producer sends MetadataRequest to least loaded node (bootstrap server){color} {color} # {color:#067d17}{color:#172b4d}Producer is ready to send InitProducerId, needs to find transaction coordinator{color}{color} # {color:#067d17}{color:#172b4d}Producer creates FindCoordinator request, but the only node known is the bootstrap server. Producer cannot send to this node since there is already the Metadata request in flight and max.inflight is 1.{color}{color} # {color:#067d17}{color:#172b4d}Producer waits without polling, so Metadata response is not processed. InitProducerId times out eventually.{color}{color} {color:#067d17}{color:#172b4d}We need to update the condition used to determine whether Sender should poll() to fix this issue.{color}{color} -- This message was sent by Atlassian Jira (v8.3.4#803005)