[jira] [Resolved] (KAFKA-15510) Follower's lastFetchedEpoch wrongly set when fetch response has no record

2023-09-28 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram resolved KAFKA-15510.

Fix Version/s: 3.7.0
   Resolution: Fixed

> Follower's lastFetchedEpoch wrongly set when fetch response has no record
> -
>
> Key: KAFKA-15510
> URL: https://issues.apache.org/jira/browse/KAFKA-15510
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.6.0, 3.5.1
>Reporter: Chern Yih Cheah
>Assignee: Chern Yih Cheah
>Priority: Major
> Fix For: 3.7.0
>
>
> A regression is introduced by 
> [https://github.com/apache/kafka/pull/13843/files#diff-508e9dc4d52744119dda36d69ce63a1901abfd3080ca72fc4554250b7e9f5242.|https://github.com/apache/kafka/pull/13843/files#diff-508e9dc4d52744119dda36d69ce63a1901abfd3080ca72fc4554250b7e9f5242]
>  When the fetch response has no record for a partition, validBytes is 0. In 
> this case, we shouldn't set the last fetch epoch to 
> logAppendInfo.lastLeaderEpoch.asScala since there is no record and it is 
> Optional.empty. We should use currentFetchState.lastFetchedEpoch instead.
> An effect of this is truncation of fetch might not work correctly.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (KAFKA-15510) Follower's lastFetchedEpoch wrongly set when fetch response has no record

2023-09-28 Thread Rajini Sivaram (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-15510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770051#comment-17770051
 ] 

Rajini Sivaram commented on KAFKA-15510:


This does not currently impact truncation in followers, but it seems useful to 
fix the regression to avoid breaking in future, if the timing scenario for 
truncation changes.

> Follower's lastFetchedEpoch wrongly set when fetch response has no record
> -
>
> Key: KAFKA-15510
> URL: https://issues.apache.org/jira/browse/KAFKA-15510
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.6.0, 3.5.1
>Reporter: Chern Yih Cheah
>Assignee: Chern Yih Cheah
>Priority: Major
>
> A regression is introduced by 
> [https://github.com/apache/kafka/pull/13843/files#diff-508e9dc4d52744119dda36d69ce63a1901abfd3080ca72fc4554250b7e9f5242.|https://github.com/apache/kafka/pull/13843/files#diff-508e9dc4d52744119dda36d69ce63a1901abfd3080ca72fc4554250b7e9f5242]
>  When the fetch response has no record for a partition, validBytes is 0. In 
> this case, we shouldn't set the last fetch epoch to 
> logAppendInfo.lastLeaderEpoch.asScala since there is no record and it is 
> Optional.empty. We should use currentFetchState.lastFetchedEpoch instead.
> An effect of this is truncation of fetch might not work correctly.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-15117) SslTransportLayerTest.testValidEndpointIdentificationCN fails with Java 20 & 21

2023-09-25 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-15117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram resolved KAFKA-15117.

  Reviewer: Rajini Sivaram
Resolution: Fixed

> SslTransportLayerTest.testValidEndpointIdentificationCN fails with Java 20 & 
> 21
> ---
>
> Key: KAFKA-15117
> URL: https://issues.apache.org/jira/browse/KAFKA-15117
> Project: Kafka
>  Issue Type: Bug
>Reporter: Ismael Juma
>Assignee: Purshotam Chauhan
>Priority: Major
> Fix For: 3.7.0
>
>
> All variations fail as seen below. These tests have been disabled when run 
> with Java 20 & 21 for now.
> {code:java}
> Gradle Test Run :clients:test > Gradle Test Executor 12 > 
> SslTransportLayerTest > testValidEndpointIdentificationCN(Args) > [1] 
> tlsProtocol=TLSv1.2, useInlinePem=false FAILED
>     org.opentest4j.AssertionFailedError: Channel 0 was not ready after 30 
> seconds ==> expected:  but was: 
>         at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>         at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
>         at 
> app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)
>         at 
> app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36)
>         at 
> app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:211)
>         at 
> app//org.apache.kafka.common.network.NetworkTestUtils.waitForChannelReady(NetworkTestUtils.java:107)
>         at 
> app//org.apache.kafka.common.network.NetworkTestUtils.checkClientConnection(NetworkTestUtils.java:70)
>         at 
> app//org.apache.kafka.common.network.SslTransportLayerTest.verifySslConfigs(SslTransportLayerTest.java:1296)
>         at 
> app//org.apache.kafka.common.network.SslTransportLayerTest.testValidEndpointIdentificationCN(SslTransportLayerTest.java:202)
> org.apache.kafka.common.network.SslTransportLayerTest.testValidEndpointIdentificationCN(Args)[2]
>  failed, log available in 
> /home/ijuma/src/kafka/clients/build/reports/testOutput/org.apache.kafka.common.network.SslTransportLayerTest.testValidEndpointIdentificationCN(Args)[2].test.stdout
> Gradle Test Run :clients:test > Gradle Test Executor 12 > 
> SslTransportLayerTest > testValidEndpointIdentificationCN(Args) > [2] 
> tlsProtocol=TLSv1.2, useInlinePem=true FAILED
>     org.opentest4j.AssertionFailedError: Channel 0 was not ready after 30 
> seconds ==> expected:  but was: 
>         at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>         at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
>         at 
> app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)
>         at 
> app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36)
>         at 
> app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:211)
>         at 
> app//org.apache.kafka.common.network.NetworkTestUtils.waitForChannelReady(NetworkTestUtils.java:107)
>         at 
> app//org.apache.kafka.common.network.NetworkTestUtils.checkClientConnection(NetworkTestUtils.java:70)
>         at 
> app//org.apache.kafka.common.network.SslTransportLayerTest.verifySslConfigs(SslTransportLayerTest.java:1296)
>         at 
> app//org.apache.kafka.common.network.SslTransportLayerTest.testValidEndpointIdentificationCN(SslTransportLayerTest.java:202)
> org.apache.kafka.common.network.SslTransportLayerTest.testValidEndpointIdentificationCN(Args)[3]
>  failed, log available in 
> /home/ijuma/src/kafka/clients/build/reports/testOutput/org.apache.kafka.common.network.SslTransportLayerTest.testValidEndpointIdentificationCN(Args)[3].test.stdout
> Gradle Test Run :clients:test > Gradle Test Executor 12 > 
> SslTransportLayerTest > testValidEndpointIdentificationCN(Args) > [3] 
> tlsProtocol=TLSv1.3, useInlinePem=false FAILED
>     org.opentest4j.AssertionFailedError: Channel 0 was not ready after 30 
> seconds ==> expected:  but was: 
>         at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>         at 
> app//org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
>         at 
> app//org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)
>         at 
> app//org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36)
>         at 
> app//org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:211)
>         at 
> app//org.apache.kafka.common.network.NetworkTestUtils.waitForChannelReady(NetworkTestUtils.java:107)
>         at 
> app//org.apache.kafka.common.network.NetworkTestUtils.checkClientConnection(NetworkTestUtils.java:70)
>  

[jira] [Resolved] (KAFKA-14891) Fix rack-aware range assignor to improve rack-awareness with co-partitioning

2023-04-12 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram resolved KAFKA-14891.

Fix Version/s: 3.5.0
 Reviewer: David Jacot
   Resolution: Fixed

> Fix rack-aware range assignor to improve rack-awareness with co-partitioning
> 
>
> Key: KAFKA-14891
> URL: https://issues.apache.org/jira/browse/KAFKA-14891
> Project: Kafka
>  Issue Type: Bug
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>Priority: Major
> Fix For: 3.5.0
>
>
> We currently check all states for rack-aware assignment with co-partitioning 
> ([https://github.com/apache/kafka/blob/396536bb5aa1ba78c71ea824d736640b615bda8a/clients/src/main/java/org/apache/kafka/clients/consumer/RangeAssignor.java#L176).]
>  We should check each group of co-partitioned states separately so that we 
> can use rack-aware assignment with co-partitioning for subsets of topics.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14891) Fix rack-aware range assignor to improve rack-awareness with co-partitioning

2023-04-11 Thread Rajini Sivaram (Jira)
Rajini Sivaram created KAFKA-14891:
--

 Summary: Fix rack-aware range assignor to improve rack-awareness 
with co-partitioning
 Key: KAFKA-14891
 URL: https://issues.apache.org/jira/browse/KAFKA-14891
 Project: Kafka
  Issue Type: Bug
Reporter: Rajini Sivaram
Assignee: Rajini Sivaram


We currently check all states for rack-aware assignment with co-partitioning 
([https://github.com/apache/kafka/blob/396536bb5aa1ba78c71ea824d736640b615bda8a/clients/src/main/java/org/apache/kafka/clients/consumer/RangeAssignor.java#L176).]
 We should check each group of co-partitioned states separately so that we can 
use rack-aware assignment with co-partitioning for subsets of topics.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14450) Rack-aware partition assignment for consumers (KIP-881)

2023-04-03 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram resolved KAFKA-14450.

Resolution: Fixed

> Rack-aware partition assignment for consumers (KIP-881)
> ---
>
> Key: KAFKA-14450
> URL: https://issues.apache.org/jira/browse/KAFKA-14450
> Project: Kafka
>  Issue Type: New Feature
>  Components: consumer
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>Priority: Major
>
> Top-level ticket for KIP-881 since we are splitting the PR into 3.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14452) Make sticky assignors rack-aware if consumer racks are configured.

2023-04-03 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram resolved KAFKA-14452.

Fix Version/s: 3.5.0
 Reviewer: David Jacot
   Resolution: Fixed

> Make sticky assignors rack-aware if consumer racks are configured.
> --
>
> Key: KAFKA-14452
> URL: https://issues.apache.org/jira/browse/KAFKA-14452
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>Priority: Major
> Fix For: 3.5.0
>
>
> See KIP-881 for details



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14867) Trigger rebalance when replica racks change if client.rack is configured

2023-03-31 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram resolved KAFKA-14867.

  Reviewer: David Jacot
Resolution: Fixed

> Trigger rebalance when replica racks change if client.rack is configured
> 
>
> Key: KAFKA-14867
> URL: https://issues.apache.org/jira/browse/KAFKA-14867
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>Priority: Major
> Fix For: 3.5.0
>
>
> To improve locality after reassignments, trigger rebalance from leader if set 
> of racks of partition replicas change.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14867) Trigger rebalance when replica racks change if client.rack is configured

2023-03-29 Thread Rajini Sivaram (Jira)
Rajini Sivaram created KAFKA-14867:
--

 Summary: Trigger rebalance when replica racks change if 
client.rack is configured
 Key: KAFKA-14867
 URL: https://issues.apache.org/jira/browse/KAFKA-14867
 Project: Kafka
  Issue Type: Sub-task
  Components: consumer
Reporter: Rajini Sivaram
Assignee: Rajini Sivaram
 Fix For: 3.5.0


To improve locality after reassignments, trigger rebalance from leader if set 
of racks of partition replicas change.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14770) Allow dynamic keystore update for brokers if string representation of DN matches even if canonical DNs don't match

2023-03-07 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram resolved KAFKA-14770.

  Reviewer: Manikumar
Resolution: Fixed

> Allow dynamic keystore update for brokers if string representation of DN 
> matches even if canonical DNs don't match
> --
>
> Key: KAFKA-14770
> URL: https://issues.apache.org/jira/browse/KAFKA-14770
> Project: Kafka
>  Issue Type: Improvement
>  Components: security
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>Priority: Major
> Fix For: 3.5.0
>
>
> To avoid mistakes during dynamic broker config updates that could potentially 
> affect clients, we restrict changes that can be performed dynamically without 
> broker restart. For broker keystore updates, we require the DN to be the same 
> for the old and new certificates since this could potentially contain host 
> names used for host name verification by clients. DNs are compared using 
> standard Java implementation of X500Principal.equals() which compares 
> canonical names. If tags of fields change from one with a printable string 
> representation and one without or vice-versa, canonical name check fails even 
> if the actual name is the same since canonical representation converts to hex 
> for some tags only. We can relax the verification to allow dynamic updates in 
> this case by enabling dynamic update if either the canonical name or the 
> RFC2253 string representation of the DN matches.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14451) Make range assignor rack-aware if consumer racks are configured

2023-03-01 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram resolved KAFKA-14451.

Fix Version/s: 3.5.0
 Reviewer: David Jacot
   Resolution: Fixed

> Make range assignor rack-aware if consumer racks are configured
> ---
>
> Key: KAFKA-14451
> URL: https://issues.apache.org/jira/browse/KAFKA-14451
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>Priority: Major
> Fix For: 3.5.0
>
>
> See KIP-881 for details



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14770) Allow dynamic keystore update for brokers if string representation of DN matches even if canonical DNs don't match

2023-03-01 Thread Rajini Sivaram (Jira)
Rajini Sivaram created KAFKA-14770:
--

 Summary: Allow dynamic keystore update for brokers if string 
representation of DN matches even if canonical DNs don't match
 Key: KAFKA-14770
 URL: https://issues.apache.org/jira/browse/KAFKA-14770
 Project: Kafka
  Issue Type: Improvement
  Components: security
Reporter: Rajini Sivaram
Assignee: Rajini Sivaram
 Fix For: 3.5.0


To avoid mistakes during dynamic broker config updates that could potentially 
affect clients, we restrict changes that can be performed dynamically without 
broker restart. For broker keystore updates, we require the DN to be the same 
for the old and new certificates since this could potentially contain host 
names used for host name verification by clients. DNs are compared using 
standard Java implementation of X500Principal.equals() which compares canonical 
names. If tags of fields change from one with a printable string representation 
and one without or vice-versa, canonical name check fails even if the actual 
name is the same since canonical representation converts to hex for some tags 
only. We can relax the verification to allow dynamic updates in this case by 
enabling dynamic update if either the canonical name or the RFC2253 string 
representation of the DN matches.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14676) Token endpoint URL used for OIDC cannot be set on the JAAS config

2023-02-12 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram resolved KAFKA-14676.

Fix Version/s: 3.5.0
   3.4.1
   3.3.3
 Reviewer: Manikumar
   Resolution: Fixed

> Token endpoint URL used for OIDC cannot be set on the JAAS config
> -
>
> Key: KAFKA-14676
> URL: https://issues.apache.org/jira/browse/KAFKA-14676
> Project: Kafka
>  Issue Type: Bug
>  Components: security
>Affects Versions: 3.1.2, 3.4.0, 3.2.3, 3.3.2
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>Priority: Major
> Fix For: 3.5.0, 3.4.1, 3.3.3
>
>
> Kafka allows multiple clients within a JVM to use different SASL 
> configurations by configuring the JAAS configuration in `sasl.jaas.config` 
> instead of the JVM-wide system property. For SASL login, we reuse logins 
> within a JVM by caching logins indexed by their sasl.jaas.config. This relies 
> on login configs being overridable using `sasl.jaas.config`. 
> KIP-768 
> ([https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=186877575)]
>  added support for OIDC for SASL/OAUTHBEARER. The token endpoint used to 
> acquire tokens can currently only be configured using the Kafka config 
> `sasl.oauthbearer.token.endpoint.url`. This prevents different clients within 
> a JVM from using different URLs. We need to either provide a way to override 
> the URL within `sasl.jaas.config` or include more of the client configs in 
> the LoginMetadata used as key for cached logins.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14676) Token endpoint URL used for OIDC cannot be set on the JAAS config

2023-02-06 Thread Rajini Sivaram (Jira)
Rajini Sivaram created KAFKA-14676:
--

 Summary: Token endpoint URL used for OIDC cannot be set on the 
JAAS config
 Key: KAFKA-14676
 URL: https://issues.apache.org/jira/browse/KAFKA-14676
 Project: Kafka
  Issue Type: Bug
  Components: security
Affects Versions: 3.3.2, 3.2.3, 3.1.2, 3.4.0
Reporter: Rajini Sivaram
Assignee: Rajini Sivaram


Kafka allows multiple clients within a JVM to use different SASL configurations 
by configuring the JAAS configuration in `sasl.jaas.config` instead of the 
JVM-wide system property. For SASL login, we reuse logins within a JVM by 
caching logins indexed by their sasl.jaas.config. This relies on login configs 
being overridable using `sasl.jaas.config`. 

KIP-768 
([https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=186877575)] 
added support for OIDC for SASL/OAUTHBEARER. The token endpoint used to acquire 
tokens can currently only be configured using the Kafka config 
`sasl.oauthbearer.token.endpoint.url`. This prevents different clients within a 
JVM from using different URLs. We need to either provide a way to override the 
URL within `sasl.jaas.config` or include more of the client configs in the 
LoginMetadata used as key for cached logins.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-14352) Support rack-aware partition assignment for Kafka consumers

2022-12-08 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram resolved KAFKA-14352.

  Reviewer: David Jacot
Resolution: Fixed

This includes protocol changes for rack-aware assignment. Default assignors 
will be made rack-aware under follow-on tickets in the next release.

> Support rack-aware partition assignment for Kafka consumers
> ---
>
> Key: KAFKA-14352
> URL: https://issues.apache.org/jira/browse/KAFKA-14352
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>Priority: Major
> Fix For: 3.4.0
>
>
> KIP-392 added support for consumers to fetch from the replica in their local 
> rack. To benefit from locality, consumers need to be assigned partitions 
> which have a replica in the same rack. This works well when replication 
> factor is the same as the number of racks, since every rack would then have a 
> replica with rack-aware replica assignment. If the number of racks is higher, 
> some racks may not have replicas of some partitions and hence consumers in 
> these racks will have to fetch from another rack. It will be useful to 
> propagate rack in the subscription metadata for consumers and provide a 
> rack-aware partition assignor for consumers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14452) Make sticky assignors rack-aware if consumer racks are configured.

2022-12-07 Thread Rajini Sivaram (Jira)
Rajini Sivaram created KAFKA-14452:
--

 Summary: Make sticky assignors rack-aware if consumer racks are 
configured.
 Key: KAFKA-14452
 URL: https://issues.apache.org/jira/browse/KAFKA-14452
 Project: Kafka
  Issue Type: Sub-task
Reporter: Rajini Sivaram
Assignee: Rajini Sivaram


See KIP-881 for details



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14451) Make range assignor rack-aware if consumer racks are configured

2022-12-07 Thread Rajini Sivaram (Jira)
Rajini Sivaram created KAFKA-14451:
--

 Summary: Make range assignor rack-aware if consumer racks are 
configured
 Key: KAFKA-14451
 URL: https://issues.apache.org/jira/browse/KAFKA-14451
 Project: Kafka
  Issue Type: Sub-task
Reporter: Rajini Sivaram
Assignee: Rajini Sivaram


See KIP-881 for details



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-14352) Support rack-aware partition assignment for Kafka consumers

2022-12-07 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-14352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram updated KAFKA-14352:
---
Parent: KAFKA-14450
Issue Type: Sub-task  (was: New Feature)

> Support rack-aware partition assignment for Kafka consumers
> ---
>
> Key: KAFKA-14352
> URL: https://issues.apache.org/jira/browse/KAFKA-14352
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>Priority: Major
> Fix For: 3.4.0
>
>
> KIP-392 added support for consumers to fetch from the replica in their local 
> rack. To benefit from locality, consumers need to be assigned partitions 
> which have a replica in the same rack. This works well when replication 
> factor is the same as the number of racks, since every rack would then have a 
> replica with rack-aware replica assignment. If the number of racks is higher, 
> some racks may not have replicas of some partitions and hence consumers in 
> these racks will have to fetch from another rack. It will be useful to 
> propagate rack in the subscription metadata for consumers and provide a 
> rack-aware partition assignor for consumers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14450) Rack-aware partition assignment for consumers (KIP-881)

2022-12-07 Thread Rajini Sivaram (Jira)
Rajini Sivaram created KAFKA-14450:
--

 Summary: Rack-aware partition assignment for consumers (KIP-881)
 Key: KAFKA-14450
 URL: https://issues.apache.org/jira/browse/KAFKA-14450
 Project: Kafka
  Issue Type: New Feature
  Components: consumer
Reporter: Rajini Sivaram
Assignee: Rajini Sivaram


Top-level ticket for KIP-881 since we are splitting the PR into 3.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-14352) Support rack-aware partition assignment for Kafka consumers

2022-11-02 Thread Rajini Sivaram (Jira)
Rajini Sivaram created KAFKA-14352:
--

 Summary: Support rack-aware partition assignment for Kafka 
consumers
 Key: KAFKA-14352
 URL: https://issues.apache.org/jira/browse/KAFKA-14352
 Project: Kafka
  Issue Type: New Feature
  Components: consumer
Reporter: Rajini Sivaram
Assignee: Rajini Sivaram
 Fix For: 3.4.0


KIP-392 added support for consumers to fetch from the replica in their local 
rack. To benefit from locality, consumers need to be assigned partitions which 
have a replica in the same rack. This works well when replication factor is the 
same as the number of racks, since every rack would then have a replica with 
rack-aware replica assignment. If the number of racks is higher, some racks may 
not have replicas of some partitions and hence consumers in these racks will 
have to fetch from another rack. It will be useful to propagate rack in the 
subscription metadata for consumers and provide a rack-aware partition assignor 
for consumers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (KAFKA-13559) The broker's ProduceResponse may be delayed for 300ms

2022-08-15 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram resolved KAFKA-13559.

Fix Version/s: 3.4.0
 Reviewer: Rajini Sivaram
   Resolution: Fixed

> The broker's  ProduceResponse may be delayed for 300ms
> --
>
> Key: KAFKA-13559
> URL: https://issues.apache.org/jira/browse/KAFKA-13559
> Project: Kafka
>  Issue Type: Task
>  Components: core
>Affects Versions: 2.7.0
>Reporter: frankshi
>Assignee: Badai Aqrandista
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: image-1.png, image-2.png, 
> image-2021-12-21-14-44-56-689.png, image-2021-12-21-14-45-22-716.png, 
> image-3.png, image-5.png, image-6.png, image-7.png, image.png
>
>
> Hi team:
> We have found the value in the source code 
> [here|https://github.com/apache/kafka/blob/2.7/core/src/main/scala/kafka/network/SocketServer.scala#L922]
>  may lead broker’s  ProduceResponse to be delayed for 300ms.
>  * Server-version: 2.13-2.7.0.
>  * Client-version: confluent-kafka-python-1.5.0.
> we have set the client’s  configuration as following:
> {code:java}
> ling.ms = 0
> acks = 1
> delivery.timeout.ms = 100
> request.timeout.ms =  80
> Sasl.mechanism =  “PLAIN”
> Security.protocol  =  “SASL_SSL”
> ..
> {code}
> Because we set ACKs = 1, the client sends ProduceRequests and receives 
> ProduceResponses from brokers. The leader broker doesn't need to wait for the 
> ISR’s writing data to disk successfully.  It can reply to the client by 
> sending ProduceResponses directly. In our situation, the ping value between 
> the client and the kafka brokers is about ~10ms, and most of the time, the 
> responses are received about 10ms after the Produce requests are sent. But 
> sometimes the responses are received about ~300ms later.
> The following shows the log from the client.
> {code:java}
> 2021-11-26 02:31:30,567  Sent partial ProduceRequest (v7, 0+16527/37366 
> bytes, CorrId 2753)
> 2021-11-26 02:31:30,568  Sent partial ProduceRequest (v7, 16527+16384/37366 
> bytes, CorrId 2753)
> 2021-11-26 02:31:30,568  Sent ProduceRequest (v7, 37366 bytes @ 32911, CorrId 
> 2753)
> 2021-11-26 02:31:30,570  Sent ProduceRequest (v7, 4714 bytes @ 0, CorrId 2754)
> 2021-11-26 02:31:30,571  Sent ProduceRequest (v7, 1161 bytes @ 0, CorrId 2755)
> 2021-11-26 02:31:30,572  Sent ProduceRequest (v7, 1240 bytes @ 0, CorrId 2756)
> 2021-11-26 02:31:30,572  Received ProduceResponse (v7, 69 bytes, CorrId 2751, 
> rtt 9.79ms)
> 2021-11-26 02:31:30,572  Received ProduceResponse (v7, 69 bytes, CorrId 2752, 
> rtt 10.34ms)
> 2021-11-26 02:31:30,573  Received ProduceResponse (v7, 69 bytes, CorrId 2753, 
> rtt 10.11ms)
> 2021-11-26 02:31:30,872  Received ProduceResponse (v7, 69 bytes, CorrId 2754, 
> rtt 309.69ms)
> 2021-11-26 02:31:30,883  Sent ProduceRequest (v7, 1818 bytes @ 0, CorrId 2757)
> 2021-11-26 02:31:30,887  Sent ProduceRequest (v7, 1655 bytes @ 0, CorrId 2758)
> 2021-11-26 02:31:30,888  Received ProduceResponse (v7, 69 bytes, CorrId 2755, 
> rtt 318.85ms)
> 2021-11-26 02:31:30,893  Sent partial ProduceRequest (v7, 0+16527/37562 
> bytes, CorrId 2759)
> 2021-11-26 02:31:30,894  Sent partial ProduceRequest (v7, 16527+16384/37562 
> bytes, CorrId 2759)
> 2021-11-26 02:31:30,895  Sent ProduceRequest (v7, 37562 bytes @ 32911, CorrId 
> 2759)
> 2021-11-26 02:31:30,896  Sent ProduceRequest (v7, 4700 bytes @ 0, CorrId 2760)
> 2021-11-26 02:31:30,897  Received ProduceResponse (v7, 69 bytes, CorrId 2756, 
> rtt 317.74ms)
> 2021-11-26 02:31:30,897  Received ProduceResponse (v7, 69 bytes, CorrId 2757, 
> rtt 4.22ms)
> 2021-11-26 02:31:30,899  Received ProduceResponse (v7, 69 bytes, CorrId 2758, 
> rtt 2.61ms){code}
>  
> The requests of CorrId 2753 and 2754 are almost sent at the same time, but 
> the Response of 2754 is delayed for ~300ms. 
> We checked the logs on the broker.
>  
> {code:java}
> [2021-11-26 02:31:30,873] DEBUG Completed 
> request:RequestHeader(apiKey=PRODUCE, apiVersion=7, clientId=rdkafka, 
> correlationId=2754) – {acks=1,timeout=80,numPartitions=1},response: 
> {responses=[\{topic=***,partition_responses=[{partition=32,error_code=0,base_offset=58625,log_append_time=-1,log_start_offset=49773}]}
> ],throttle_time_ms=0} from connection 
> 10.10.44.59:9093-10.10.0.68:31183-66;totalTime:0.852,requestQueueTime:0.128,localTime:0.427,remoteTime:0.09,throttleTime:0,responseQueueTime:0.073,sendTime:0.131,securityProtocol:SASL_SSL,principal:User:***,listener:SASL_SSL,clientInformation:ClientInformation(softwareName=confluent-kafka-python,
>  softwareVersion=1.5.0-rdkafka-1.5.2) (kafka.request.logger)
> {code}
>  
>  
> It seems that the time cost on the server side is very small. What’s the 
> reason for the latency spikes?
> We also did tcpdump  at the server side and 

[jira] [Resolved] (KAFKA-13539) Improve propagation and processing of SSL handshake failures

2021-12-14 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram resolved KAFKA-13539.

  Reviewer: Manikumar
Resolution: Fixed

> Improve propagation and processing of SSL handshake failures
> 
>
> Key: KAFKA-13539
> URL: https://issues.apache.org/jira/browse/KAFKA-13539
> Project: Kafka
>  Issue Type: Bug
>  Components: security
>Affects Versions: 3.1.0
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>Priority: Major
> Fix For: 3.2.0
>
>
> {color:#172b4d}When server fails SSL handshake and closes its connection, we 
> attempt to report this to clients on a best-effort basis. However, our tests 
> assume that peer always detects the failure. This may not be the case when 
> there are delays. It will be good to improve reliability of handshake failure 
> reporting. {color}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (KAFKA-13539) Improve propagation and processing of SSL handshake failures

2021-12-13 Thread Rajini Sivaram (Jira)
Rajini Sivaram created KAFKA-13539:
--

 Summary: Improve propagation and processing of SSL handshake 
failures
 Key: KAFKA-13539
 URL: https://issues.apache.org/jira/browse/KAFKA-13539
 Project: Kafka
  Issue Type: Bug
  Components: security
Affects Versions: 3.1.0
Reporter: Rajini Sivaram
Assignee: Rajini Sivaram
 Fix For: 3.2.0


{color:#172b4d}When server fails SSL handshake and closes its connection, we 
attempt to report this to clients on a best-effort basis. However, our tests 
assume that peer always detects the failure. This may not be the case when 
there are delays. It will be good to improve reliability of handshake failure 
reporting. {color}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (KAFKA-13461) KafkaController stops functioning as active controller after ZooKeeperClient auth failure

2021-12-02 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram resolved KAFKA-13461.

Fix Version/s: 3.1.0
   3.0.1
 Reviewer: Jun Rao
   Resolution: Fixed

> KafkaController stops functioning as active controller after ZooKeeperClient 
> auth failure
> -
>
> Key: KAFKA-13461
> URL: https://issues.apache.org/jira/browse/KAFKA-13461
> Project: Kafka
>  Issue Type: Bug
>  Components: zkclient
>Reporter: Vincent Jiang
>Assignee: Rajini Sivaram
>Priority: Major
> Fix For: 3.1.0, 3.0.1
>
>
> When java.security.auth.login.config is present, but there is no "Client" 
> section,  ZookeeperSaslClient creation fails and raises LoginExcpetion, 
> result in warning log:
> {code:java}
> WARN SASL configuration failed: javax.security.auth.login.LoginException: No 
> JAAS configuration section named 'Client' was found in specified JAAS 
> configuration file: '***'. Will continue connection to Zookeeper server 
> without SASL authentication, if Zookeeper server allows it.{code}
> When this happens after initial startup, ClientCnxn enqueues an AuthFailed 
> event which will trigger following sequence:
>  # zkclient reinitialization is triggered
>  # the controller resigns.
>  # Before the controller's ZK session expires, the controller successfully 
> connect to ZK and maintains the current session
>  # In KafkaController.elect(), the controller sets activeControllerId to 
> itself and short-circuits the rest of the elect. Since the controller 
> resigned earlier and also skips the call to onControllerFailover(), the 
> controller is not actually functioning as the active controller (e.g. the 
> necessary ZK watchers haven't been registered).
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (KAFKA-13405) Kafka Streams restore-consumer fails to refresh broker IPs after upgrading Kafka cluster

2021-11-19 Thread Rajini Sivaram (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-13405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17446375#comment-17446375
 ] 

Rajini Sivaram commented on KAFKA-13405:


I am not familiar with the specific restore consumer referred to in the ticket, 
but Kafka producers and consumers resolve bootstrap servers when they start up. 
All the matching addresses are cached, so some changes are ok. Clients which 
are connected to brokers receive metadata from brokers and will continue to 
work with a rolling upgrade that changes all IPs since those get re-resolved. 
But if a client started up with bootstrap servers that resolved to old IPs and 
attempts to get metadata using bootstrap servers after all ips changed, then 
that wouldn't work because we don't resolve bootstrap servers again. Looks like 
that is the scenario with the restore consumer.

> Kafka Streams restore-consumer fails to refresh broker IPs after upgrading 
> Kafka cluster
> 
>
> Key: KAFKA-13405
> URL: https://issues.apache.org/jira/browse/KAFKA-13405
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.8.0, 2.7.1
>Reporter: Daniel O'Halloran
>Priority: Critical
> Attachments: KafkaConfig.txt, KafkaLogs.txt
>
>
> *+Description+*
> After upgrading our Kafka clusters from 2.7 to 2.8 the Streams 
> restore-consumers never update their broker IPs.
> The applications continue to process events as normal, until there is a 
> rebalance.
> Once a rebalance occurs the restore consumers attempts to connect to the old 
> brokers IPs indefinitely and the streams tasks never go back into a RUNNING 
> state.
> We were able to replicate this behaviour with kafka-streams client libraries 
> 2.5.1, 2.7.1 and 2.8.0
>  
> *+Steps to reproduce+*
>  # Upgrade brokers from Kafka 2.7 to Kafka 2.8
>  # Ensure old brokers are completely shut down
>  # Trigger a rebalance of a streams application
>  
> *+Expected result+*
>  * Application rebalances as normal
>  
> *+Actual Result+*
>  * Application cannot restore its data
>  * restore consumer tries to connect to old brokers indefinitely
>  
> *+Observations+*
>  * The cluster metadata was updated on all stream consumer threads during the 
> broker upgrade (multiple times) as the new brokers were brought online 
> (corresponding to leader election occurring on the subscribed partitions), 
> however no cluster metadata was logged on the restore-consumer thread.
>  * None of the original broker IPs are valid/accessible after the upgrade (as 
> expected)
>  * No partition rebalance occurs during the kafka upgrade process.
>  * When the first re-balance was triggered after upgrade, the 
> restore-consumer loops failing to connect on each of the 3 original IPs, but 
> none of the new broker IPs.
>  
> *+Workaround+*
>  * Restart our applications after upgrading our Kafka cluster



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (KAFKA-13422) Even if the correct username and password are configured, when ClientBroker or KafkaClient tries to establish a SASL connection to ServerBroker, an exception is thrown

2021-11-16 Thread Rajini Sivaram (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-13422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444517#comment-17444517
 ] 

Rajini Sivaram commented on KAFKA-13422:


Yes, I remember now, we do take the first entry from Subject in the default 
callback handlers, so the current behaviour isn't great. But I am still 
unconvinced about removing support for a case that has always existed and can 
be made to work as required when implementing custom login modules and callback 
handlers. When we added `sasl.jaas.config` later, we chose to restrict to one 
login module to avoid the issues that existed with the JAAS file 
implementation. And now that we recommend the properties option anyway, I am 
not sure of the value of failing early in the JAAS file case. Perhaps logging 
an error instead of failing altogether would be better? [~ijuma] What do you 
think?

> Even if the correct username and password are configured, when ClientBroker 
> or KafkaClient tries to establish a SASL connection to ServerBroker, an 
> exception is thrown: (Authentication failed: Invalid username or password)
> --
>
> Key: KAFKA-13422
> URL: https://issues.apache.org/jira/browse/KAFKA-13422
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, core
>Affects Versions: 2.7.1, 3.0.0
>Reporter: RivenSun
>Priority: Major
> Attachments: CustomerAuthCallbackHandler.java, 
> LoginContext_login_debug.png, SaslClientCallbackHandler_handle_debug.png
>
>
>  
> h1. Foreword:
> When deploying a Kafka cluster with a higher version (2.7.1), I encountered 
> an exception of communication identity authentication failure between 
> brokers. In the current latest version 3.0.0, this problem can also be 
> reproduced.
> h1. Problem recurring:
> h2. 1)broker Version is 3.0.0
> h3. The content of kafka_server_jaas.conf of each broker is exactly the same, 
> the content is as follows:
>  
>  
> {code:java}
> KafkaServer {
>   org.apache.kafka.common.security.plain.PlainLoginModule required
>   username="admin"
>   password="kJTVDziatPgjXG82sFHc4O1EIuewmlvS"
>   user_admin="kJTVDziatPgjXG82sFHc4O1EIuewmlvS"
>   user_alice="alice";
>   org.apache.kafka.common.security.scram.ScramLoginModule required
>   username="admin_scram"
>   password="admin_scram_password";
>  
> };
> {code}
>  
>  
> h3. broker server.properties:
> One of the broker configuration files is provided, and the content of the 
> configuration files of other brokers is only different from the localPublicIp 
> of advertised.listeners.
>  
> {code:java}
> broker.id=1
> broker.rack=us-east-1a
> advertised.listeners=SASL_PLAINTEXT://localPublicIp:9779,SASL_SSL://localPublicIp:9889,INTERNAL_SSL://:9009,PLAIN_PLUGIN_SSL://localPublicIp:9669
> log.dirs=/asyncmq/kafka/data_1,/asyncmq/kafka/data_2
> zookeeper.connect=***
> listeners=SASL_PLAINTEXT://:9779,SASL_SSL://:9889,INTERNAL_SSL://:9009,PLAIN_PLUGIN_SSL://:9669
> listener.security.protocol.map=INTERNAL_SSL:SASL_SSL,PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL,PLAIN_PLUGIN_SSL:SASL_SSL
> listener.name.plain_plugin_ssl.plain.sasl.server.callback.handler.class=org.apache.kafka.common.security.plain.internals.PlainServerCallbackHandler
> #ssl config
> ssl.keystore.password=***
> ssl.key.password=***
> ssl.truststore.password=***
> ssl.keystore.location=***
> ssl.truststore.location=***
> ssl.client.auth=none
> ssl.endpoint.identification.algorithm=
> #broker communicate config
> #security.inter.broker.protocol=SASL_PLAINTEXT
> inter.broker.listener.name=INTERNAL_SSL
> sasl.mechanism.inter.broker.protocol=PLAIN
> #sasl authentication config
> sasl.kerberos.service.name=kafka
> sasl.enabled.mechanisms=PLAIN,SCRAM-SHA-256,SCRAM-SHA-512,GSSAPI
> delegation.token.master.key=***
> delegation.token.expiry.time.ms=8640
> delegation.token.max.lifetime.ms=31536
> {code}
>  
>  
> Then start all brokers at the same time. Each broker has actually been 
> started successfully, but when establishing a connection between the 
> controller node and all brokers, the identity authentication has always 
> failed. The connection between brokers cannot be established normally, 
> causing the entire Kafka cluster to be unable to provide external services.
> h3. The server log keeps printing abnormally like crazy:
> The real ip sensitive information of the broker in the log, I use ** 
> instead of here
>  
> {code:java}
> [2021-10-29 14:16:19,831] INFO [SocketServer listenerType=ZK_BROKER, 
> nodeId=3] Started socket server acceptors and processors 
> (kafka.network.SocketServer)
> [2021-10-29 14:16:19,836] INFO Kafka version: 3.0.0 
> 

[jira] [Commented] (KAFKA-13422) Even if the correct username and password are configured, when ClientBroker or KafkaClient tries to establish a SASL connection to ServerBroker, an exception is thrown

2021-11-16 Thread Rajini Sivaram (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-13422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1738#comment-1738
 ] 

Rajini Sivaram commented on KAFKA-13422:


[~RivenSun] We should have probably added that check earlier. But changing it 
now from picking the first one to disallowing multiple may be a breaking 
change. We can log an error instead of throwing exception if that helps.

> Even if the correct username and password are configured, when ClientBroker 
> or KafkaClient tries to establish a SASL connection to ServerBroker, an 
> exception is thrown: (Authentication failed: Invalid username or password)
> --
>
> Key: KAFKA-13422
> URL: https://issues.apache.org/jira/browse/KAFKA-13422
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, core
>Affects Versions: 2.7.1, 3.0.0
>Reporter: RivenSun
>Priority: Major
> Attachments: CustomerAuthCallbackHandler.java, 
> LoginContext_login_debug.png, SaslClientCallbackHandler_handle_debug.png
>
>
>  
> h1. Foreword:
> When deploying a Kafka cluster with a higher version (2.7.1), I encountered 
> an exception of communication identity authentication failure between 
> brokers. In the current latest version 3.0.0, this problem can also be 
> reproduced.
> h1. Problem recurring:
> h2. 1)broker Version is 3.0.0
> h3. The content of kafka_server_jaas.conf of each broker is exactly the same, 
> the content is as follows:
>  
>  
> {code:java}
> KafkaServer {
>   org.apache.kafka.common.security.plain.PlainLoginModule required
>   username="admin"
>   password="kJTVDziatPgjXG82sFHc4O1EIuewmlvS"
>   user_admin="kJTVDziatPgjXG82sFHc4O1EIuewmlvS"
>   user_alice="alice";
>   org.apache.kafka.common.security.scram.ScramLoginModule required
>   username="admin_scram"
>   password="admin_scram_password";
>  
> };
> {code}
>  
>  
> h3. broker server.properties:
> One of the broker configuration files is provided, and the content of the 
> configuration files of other brokers is only different from the localPublicIp 
> of advertised.listeners.
>  
> {code:java}
> broker.id=1
> broker.rack=us-east-1a
> advertised.listeners=SASL_PLAINTEXT://localPublicIp:9779,SASL_SSL://localPublicIp:9889,INTERNAL_SSL://:9009,PLAIN_PLUGIN_SSL://localPublicIp:9669
> log.dirs=/asyncmq/kafka/data_1,/asyncmq/kafka/data_2
> zookeeper.connect=***
> listeners=SASL_PLAINTEXT://:9779,SASL_SSL://:9889,INTERNAL_SSL://:9009,PLAIN_PLUGIN_SSL://:9669
> listener.security.protocol.map=INTERNAL_SSL:SASL_SSL,PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL,PLAIN_PLUGIN_SSL:SASL_SSL
> listener.name.plain_plugin_ssl.plain.sasl.server.callback.handler.class=org.apache.kafka.common.security.plain.internals.PlainServerCallbackHandler
> #ssl config
> ssl.keystore.password=***
> ssl.key.password=***
> ssl.truststore.password=***
> ssl.keystore.location=***
> ssl.truststore.location=***
> ssl.client.auth=none
> ssl.endpoint.identification.algorithm=
> #broker communicate config
> #security.inter.broker.protocol=SASL_PLAINTEXT
> inter.broker.listener.name=INTERNAL_SSL
> sasl.mechanism.inter.broker.protocol=PLAIN
> #sasl authentication config
> sasl.kerberos.service.name=kafka
> sasl.enabled.mechanisms=PLAIN,SCRAM-SHA-256,SCRAM-SHA-512,GSSAPI
> delegation.token.master.key=***
> delegation.token.expiry.time.ms=8640
> delegation.token.max.lifetime.ms=31536
> {code}
>  
>  
> Then start all brokers at the same time. Each broker has actually been 
> started successfully, but when establishing a connection between the 
> controller node and all brokers, the identity authentication has always 
> failed. The connection between brokers cannot be established normally, 
> causing the entire Kafka cluster to be unable to provide external services.
> h3. The server log keeps printing abnormally like crazy:
> The real ip sensitive information of the broker in the log, I use ** 
> instead of here
>  
> {code:java}
> [2021-10-29 14:16:19,831] INFO [SocketServer listenerType=ZK_BROKER, 
> nodeId=3] Started socket server acceptors and processors 
> (kafka.network.SocketServer)
> [2021-10-29 14:16:19,836] INFO Kafka version: 3.0.0 
> (org.apache.kafka.common.utils.AppInfoParser)
> [2021-10-29 14:16:19,836] INFO Kafka commitId: 8cb0a5e9d3441962 
> (org.apache.kafka.common.utils.AppInfoParser)
> [2021-10-29 14:16:19,836] INFO Kafka startTimeMs: 1635516979831 
> (org.apache.kafka.common.utils.AppInfoParser)
> [2021-10-29 14:16:19,837] INFO [KafkaServer id=3] started 
> (kafka.server.KafkaServer)
> [2021-10-29 14:16:20,249] INFO [SocketServer listenerType=ZK_BROKER, 
> nodeId=3] Failed authentication 

[jira] [Commented] (KAFKA-13422) Even if the correct username and password are configured, when ClientBroker or KafkaClient tries to establish a SASL connection to ServerBroker, an exception is thrown

2021-11-16 Thread Rajini Sivaram (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-13422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1710#comment-1710
 ] 

Rajini Sivaram commented on KAFKA-13422:


[~RivenSun] Producers, consumers and other clients should be provided a JAAS 
configuration with a single login module, because they only ever use a single 
mechanism. We haven't supported the case where clients can choose between 
multiple logins either through the JAAS config file or the broker property.

Kafka's SASL integration is highly configurable. So if you do have an use case 
where you need to store a range of credentials in one JAAS file and have Kafka 
use one of these, you can write your own login callback handler and configure 
`sasl.login.callback.handler.class`. Then you can introduce your own logic 
based on some options to choose the appropriate credentials.

> Even if the correct username and password are configured, when ClientBroker 
> or KafkaClient tries to establish a SASL connection to ServerBroker, an 
> exception is thrown: (Authentication failed: Invalid username or password)
> --
>
> Key: KAFKA-13422
> URL: https://issues.apache.org/jira/browse/KAFKA-13422
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, core
>Affects Versions: 2.7.1, 3.0.0
>Reporter: RivenSun
>Priority: Major
> Attachments: CustomerAuthCallbackHandler.java, 
> LoginContext_login_debug.png, SaslClientCallbackHandler_handle_debug.png
>
>
>  
> h1. Foreword:
> When deploying a Kafka cluster with a higher version (2.7.1), I encountered 
> an exception of communication identity authentication failure between 
> brokers. In the current latest version 3.0.0, this problem can also be 
> reproduced.
> h1. Problem recurring:
> h2. 1)broker Version is 3.0.0
> h3. The content of kafka_server_jaas.conf of each broker is exactly the same, 
> the content is as follows:
>  
>  
> {code:java}
> KafkaServer {
>   org.apache.kafka.common.security.plain.PlainLoginModule required
>   username="admin"
>   password="kJTVDziatPgjXG82sFHc4O1EIuewmlvS"
>   user_admin="kJTVDziatPgjXG82sFHc4O1EIuewmlvS"
>   user_alice="alice";
>   org.apache.kafka.common.security.scram.ScramLoginModule required
>   username="admin_scram"
>   password="admin_scram_password";
>  
> };
> {code}
>  
>  
> h3. broker server.properties:
> One of the broker configuration files is provided, and the content of the 
> configuration files of other brokers is only different from the localPublicIp 
> of advertised.listeners.
>  
> {code:java}
> broker.id=1
> broker.rack=us-east-1a
> advertised.listeners=SASL_PLAINTEXT://localPublicIp:9779,SASL_SSL://localPublicIp:9889,INTERNAL_SSL://:9009,PLAIN_PLUGIN_SSL://localPublicIp:9669
> log.dirs=/asyncmq/kafka/data_1,/asyncmq/kafka/data_2
> zookeeper.connect=***
> listeners=SASL_PLAINTEXT://:9779,SASL_SSL://:9889,INTERNAL_SSL://:9009,PLAIN_PLUGIN_SSL://:9669
> listener.security.protocol.map=INTERNAL_SSL:SASL_SSL,PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL,PLAIN_PLUGIN_SSL:SASL_SSL
> listener.name.plain_plugin_ssl.plain.sasl.server.callback.handler.class=org.apache.kafka.common.security.plain.internals.PlainServerCallbackHandler
> #ssl config
> ssl.keystore.password=***
> ssl.key.password=***
> ssl.truststore.password=***
> ssl.keystore.location=***
> ssl.truststore.location=***
> ssl.client.auth=none
> ssl.endpoint.identification.algorithm=
> #broker communicate config
> #security.inter.broker.protocol=SASL_PLAINTEXT
> inter.broker.listener.name=INTERNAL_SSL
> sasl.mechanism.inter.broker.protocol=PLAIN
> #sasl authentication config
> sasl.kerberos.service.name=kafka
> sasl.enabled.mechanisms=PLAIN,SCRAM-SHA-256,SCRAM-SHA-512,GSSAPI
> delegation.token.master.key=***
> delegation.token.expiry.time.ms=8640
> delegation.token.max.lifetime.ms=31536
> {code}
>  
>  
> Then start all brokers at the same time. Each broker has actually been 
> started successfully, but when establishing a connection between the 
> controller node and all brokers, the identity authentication has always 
> failed. The connection between brokers cannot be established normally, 
> causing the entire Kafka cluster to be unable to provide external services.
> h3. The server log keeps printing abnormally like crazy:
> The real ip sensitive information of the broker in the log, I use ** 
> instead of here
>  
> {code:java}
> [2021-10-29 14:16:19,831] INFO [SocketServer listenerType=ZK_BROKER, 
> nodeId=3] Started socket server acceptors and processors 
> (kafka.network.SocketServer)
> [2021-10-29 14:16:19,836] INFO Kafka version: 3.0.0 
> 

[jira] [Commented] (KAFKA-13422) Even if the correct username and password are configured, when ClientBroker or KafkaClient tries to establish a SASL connection to ServerBroker, an exception is thrown

2021-11-16 Thread Rajini Sivaram (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-13422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17444396#comment-17444396
 ] 

Rajini Sivaram commented on KAFKA-13422:


[~RivenSun] The recommended way for configuring SASL for brokers is by 
specifying `listener.name...sasl.jaas.config` in 
the broker's server.properties file rather than using a separate JAAS config  
file. Using multiple login modules for KafkaServer works in some cases, but as 
you mentioned, that doesn't work when multiple mechanisms with conflicting 
options are specified for the same listener. That is a known issue that wasn't 
fixed because we have an alternative that works better with other Kafka 
features like password protection. With the broker property `sasl.jaas.config`, 
you specify the config separately for each listener and SASL mechanism and that 
avoids these type of conflicts.

> Even if the correct username and password are configured, when ClientBroker 
> or KafkaClient tries to establish a SASL connection to ServerBroker, an 
> exception is thrown: (Authentication failed: Invalid username or password)
> --
>
> Key: KAFKA-13422
> URL: https://issues.apache.org/jira/browse/KAFKA-13422
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, core
>Affects Versions: 2.7.1, 3.0.0
>Reporter: RivenSun
>Priority: Major
> Attachments: CustomerAuthCallbackHandler.java, 
> LoginContext_login_debug.png, SaslClientCallbackHandler_handle_debug.png
>
>
>  
> h1. Foreword:
> When deploying a Kafka cluster with a higher version (2.7.1), I encountered 
> an exception of communication identity authentication failure between 
> brokers. In the current latest version 3.0.0, this problem can also be 
> reproduced.
> h1. Problem recurring:
> h2. 1)broker Version is 3.0.0
> h3. The content of kafka_server_jaas.conf of each broker is exactly the same, 
> the content is as follows:
>  
>  
> {code:java}
> KafkaServer {
>   org.apache.kafka.common.security.plain.PlainLoginModule required
>   username="admin"
>   password="kJTVDziatPgjXG82sFHc4O1EIuewmlvS"
>   user_admin="kJTVDziatPgjXG82sFHc4O1EIuewmlvS"
>   user_alice="alice";
>   org.apache.kafka.common.security.scram.ScramLoginModule required
>   username="admin_scram"
>   password="admin_scram_password";
>  
> };
> {code}
>  
>  
> h3. broker server.properties:
> One of the broker configuration files is provided, and the content of the 
> configuration files of other brokers is only different from the localPublicIp 
> of advertised.listeners.
>  
> {code:java}
> broker.id=1
> broker.rack=us-east-1a
> advertised.listeners=SASL_PLAINTEXT://localPublicIp:9779,SASL_SSL://localPublicIp:9889,INTERNAL_SSL://:9009,PLAIN_PLUGIN_SSL://localPublicIp:9669
> log.dirs=/asyncmq/kafka/data_1,/asyncmq/kafka/data_2
> zookeeper.connect=***
> listeners=SASL_PLAINTEXT://:9779,SASL_SSL://:9889,INTERNAL_SSL://:9009,PLAIN_PLUGIN_SSL://:9669
> listener.security.protocol.map=INTERNAL_SSL:SASL_SSL,PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL,PLAIN_PLUGIN_SSL:SASL_SSL
> listener.name.plain_plugin_ssl.plain.sasl.server.callback.handler.class=org.apache.kafka.common.security.plain.internals.PlainServerCallbackHandler
> #ssl config
> ssl.keystore.password=***
> ssl.key.password=***
> ssl.truststore.password=***
> ssl.keystore.location=***
> ssl.truststore.location=***
> ssl.client.auth=none
> ssl.endpoint.identification.algorithm=
> #broker communicate config
> #security.inter.broker.protocol=SASL_PLAINTEXT
> inter.broker.listener.name=INTERNAL_SSL
> sasl.mechanism.inter.broker.protocol=PLAIN
> #sasl authentication config
> sasl.kerberos.service.name=kafka
> sasl.enabled.mechanisms=PLAIN,SCRAM-SHA-256,SCRAM-SHA-512,GSSAPI
> delegation.token.master.key=***
> delegation.token.expiry.time.ms=8640
> delegation.token.max.lifetime.ms=31536
> {code}
>  
>  
> Then start all brokers at the same time. Each broker has actually been 
> started successfully, but when establishing a connection between the 
> controller node and all brokers, the identity authentication has always 
> failed. The connection between brokers cannot be established normally, 
> causing the entire Kafka cluster to be unable to provide external services.
> h3. The server log keeps printing abnormally like crazy:
> The real ip sensitive information of the broker in the log, I use ** 
> instead of here
>  
> {code:java}
> [2021-10-29 14:16:19,831] INFO [SocketServer listenerType=ZK_BROKER, 
> nodeId=3] Started socket server acceptors and processors 
> (kafka.network.SocketServer)
> [2021-10-29 14:16:19,836] INFO Kafka version: 

[jira] [Commented] (KAFKA-13293) Support client reload of PEM certificates

2021-09-13 Thread Rajini Sivaram (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17414263#comment-17414263
 ] 

Rajini Sivaram commented on KAFKA-13293:


[~teabot] Sorry, I should have said `recreates SSLContext` rather than 
`SslEngineFactory`. I haven't tried it out, but I think you can implement a 
custom factory that has a mutable SSLContext which is updated when required. 
Basically we want to make sure `CustomSslEngineFactory` always has a valid 
`SSLContext` that is used by SslEngineFactory#createClientSslEngine() to create 
a new SSLEngine whenever a new connection is established, but we can swap out 
the SSLContext with a new one if we can detect a change. As you said, we cannot 
rely on `SslEngineFactory#shouldBeRebuilt()` since that is invoked only for 
brokers when ALTER_CONFIGS is processed.

> Support client reload of PEM certificates
> -
>
> Key: KAFKA-13293
> URL: https://issues.apache.org/jira/browse/KAFKA-13293
> Project: Kafka
>  Issue Type: Improvement
>  Components: clients, security
>Affects Versions: 2.7.0, 2.8.0, 2.7.1
>Reporter: Elliot West
>Priority: Major
>
> Since Kafka 2.7.0, clients are able to authenticate using PEM certificates as 
> client configuration properties in addition to JKS file based key stores 
> (KAFKA-10338). With PEM, certificate chains are passed into clients as simple 
> string based key-value properties, alongside existing client configuration. 
> This offers a number of benefits: it provides a JVM agnostic security 
> mechanism from the perspective of clients, removes the client's dependency on 
> the local filesystem, and allows the the encapsulation of the entire client 
> configuration into a single payload.
> However, the current client PEM implement has a feature regression when 
> compared with the JKS implementation. With the JKS approach, clients would 
> automatically reload certificates when the key stores were modified on disk. 
> This enables a seamless approach for the replacement of certificates when 
> they are due to expire; no further configuration or explicit interference 
> with the client lifecycle is needed for the client to migrate to renewed 
> certificates.
> Such a capability does not currently exist for PEM. One supplies key chains 
> when instantiating clients only - there is no mechanism available to either 
> directly reconfigure the client, or for the client to observe changes to the 
> original properties set reference used in construction. Additionally, no 
> work-arounds are documented that might given users alternative strategies for 
> dealing with expiring certificates. Given that expiration and renewal of 
> certificates is an industry standard practice, it could be argued that the 
> current PEM client implementation is not fit for purpose.
> In summary, a mechanism should be provided such that clients can 
> automatically detect, load, and use updated PEM key chains from some non-file 
> based source (object ref, method invocation, listener, etc.)
> Finally, It is suggested that in the short-term Kafka documentation be 
> updated to describe any viable mechanism for updating client PEM certs 
> (perhaps closing existing client and then recreating?).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-13293) Support client reload of PEM certificates

2021-09-13 Thread Rajini Sivaram (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17414217#comment-17414217
 ] 

Rajini Sivaram commented on KAFKA-13293:


[~teabot] Yes, agree that it will be useful to support certificate updates 
without restart for clients.

1) At the moment, as you said you will need to pull in new certs and restart 
clients if you are using Kafka without custom plugins.

2) If you are willing to add a custom plugin, you can add a custom 
implementation of `ssl.engine.factory.class` that recreates the 
SslEngineFactory when certs change by watching key/truststore files or pulling 
in PEM configs from an external source.

3) For the longer term, 
[https://cwiki.apache.org/confluence/display/KAFKA/KIP-649%3A+Dynamic+Client+Configuration]
 will be useful to perform dynamic updates for clients similar to how we do 
dynamic broker config updates.

> Support client reload of PEM certificates
> -
>
> Key: KAFKA-13293
> URL: https://issues.apache.org/jira/browse/KAFKA-13293
> Project: Kafka
>  Issue Type: Improvement
>  Components: clients, security
>Affects Versions: 2.7.0, 2.8.0, 2.7.1
>Reporter: Elliot West
>Priority: Major
>
> Since Kafka 2.7.0, clients are able to authenticate using PEM certificates as 
> client configuration properties in addition to JKS file based key stores 
> (KAFKA-10338). With PEM, certificate chains are passed into clients as simple 
> string based key-value properties, alongside existing client configuration. 
> This offers a number of benefits: it provides a JVM agnostic security 
> mechanism from the perspective of clients, removes the client's dependency on 
> the local filesystem, and allows the the encapsulation of the entire client 
> configuration into a single payload.
> However, the current client PEM implement has a feature regression when 
> compared with the JKS implementation. With the JKS approach, clients would 
> automatically reload certificates when the key stores were modified on disk. 
> This enables a seamless approach for the replacement of certificates when 
> they are due to expire; no further configuration or explicit interference 
> with the client lifecycle is needed for the client to migrate to renewed 
> certificates.
> Such a capability does not currently exist for PEM. One supplies key chains 
> when instantiating clients only - there is no mechanism available to either 
> directly reconfigure the client, or for the client to observe changes to the 
> original properties set reference used in construction. Additionally, no 
> work-arounds are documented that might given users alternative strategies for 
> dealing with expiring certificates. Given that expiration and renewal of 
> certificates is an industry standard practice, it could be argued that the 
> current PEM client implementation is not fit for purpose.
> In summary, a mechanism should be provided such that clients can 
> automatically detect, load, and use updated PEM key chains from some non-file 
> based source (object ref, method invocation, listener, etc.)
> Finally, It is suggested that in the short-term Kafka documentation be 
> updated to describe any viable mechanism for updating client PEM certs 
> (perhaps closing existing client and then recreating?).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-12751) ISRs remain in in-flight state if proposed state is same as actual state

2021-09-13 Thread Rajini Sivaram (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-12751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17414212#comment-17414212
 ] 

Rajini Sivaram commented on KAFKA-12751:


The bug only happens in an unusual case where the proposed ISR is the same as 
the expected ISR and the chances of hitting that are very low. And it is an 
issue only if `inter.broker.protocol.version >= 2.7`. If it does occur, further 
ISR updates don't occur, so the broker will need to be restarted. Release plan 
for 2.8.1 is here: 
[https://cwiki.apache.org/confluence/display/KAFKA/Release+Plan+2.8.1,] work is 
under way and the release is expected in within the next few weeks.

> ISRs remain in in-flight state if proposed state is same as actual state
> 
>
> Key: KAFKA-12751
> URL: https://issues.apache.org/jira/browse/KAFKA-12751
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.7.0, 2.8.0, 2.7.1
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>Priority: Critical
> Fix For: 2.7.2, 2.8.1
>
>
> If proposed ISR state in an AlterIsr request is the same as the actual state, 
> Controller returns a successful response without performing any updates. But 
> the broker code that processes the response leaves the ISR state in in-flight 
> state without committing. This prevents further ISR updates until the next 
> leader election.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-13293) Support client reload of PEM certificates

2021-09-13 Thread Rajini Sivaram (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17414137#comment-17414137
 ] 

Rajini Sivaram commented on KAFKA-13293:


[~teabot] Which client (and configs) do you use to dynamically reload JKS 
keystores in clients today? As far as I understand, default implementation in 
Apache Kafka supports reloading of keystores and truststores from files only 
for brokers. Clients can override `ssl.engine.factory.class`, but not sure if 
that is what you were referring to for dynamic reloading in clients.

> Support client reload of PEM certificates
> -
>
> Key: KAFKA-13293
> URL: https://issues.apache.org/jira/browse/KAFKA-13293
> Project: Kafka
>  Issue Type: Improvement
>  Components: clients, security
>Affects Versions: 2.7.0, 2.8.0, 2.7.1
>Reporter: Elliot West
>Priority: Major
>
> Since Kafka 2.7.0, clients are able to authenticate using PEM certificates as 
> client configuration properties in addition to JKS file based key stores 
> (KAFKA-10338). With PEM, certificate chains are passed into clients as simple 
> string based key-value properties, alongside existing client configuration. 
> This offers a number of benefits: it provides a JVM agnostic security 
> mechanism from the perspective of clients, removes the client's dependency on 
> the local filesystem, and allows the the encapsulation of the entire client 
> configuration into a single payload.
> However, the current client PEM implement has a feature regression when 
> compared with the JKS implementation. With the JKS approach, clients would 
> automatically reload certificates when the key stores were modified on disk. 
> This enables a seamless approach for the replacement of certificates when 
> they are due to expire; no further configuration or explicit interference 
> with the client lifecycle is needed for the client to migrate to renewed 
> certificates.
> Such a capability does not currently exist for PEM. One supplies key chains 
> when instantiating clients only - there is no mechanism available to either 
> directly reconfigure the client, or for the client to observe changes to the 
> original properties set reference used in construction. Additionally, no 
> work-arounds are documented that might given users alternative strategies for 
> dealing with expiring certificates. Given that expiration and renewal of 
> certificates is an industry standard practice, it could be argued that the 
> current PEM client implementation is not fit for purpose.
> In summary, a mechanism should be provided such that clients can 
> automatically detect, load, and use updated PEM key chains from some non-file 
> based source (object ref, method invocation, listener, etc.)
> Finally, It is suggested that in the short-term Kafka documentation be 
> updated to describe any viable mechanism for updating client PEM certs 
> (perhaps closing existing client and then recreating?).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (KAFKA-10338) Support PEM format for SSL certificates and private key

2021-09-10 Thread Rajini Sivaram (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-10338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17413395#comment-17413395
 ] 

Rajini Sivaram edited comment on KAFKA-10338 at 9/10/21, 9:27 PM:
--

[~teabot] We currently don't have a way of reconfiguring PEM configs for 
clients unless they are stored externally in a file and the file is reloaded. 
It may be possible to add a custom `ssl.engine.factory.class` that does 
reconfiguration for clients. For brokers, we can use standard dynamic broker 
configs for PEM.


was (Author: rsivaram):
[~teabot] We currently don't have a way of updating PEM configs for clients 
unless they are stored externally in a file and the file is reloaded. It may be 
possible to add a custom `ssl.engine.factory.class` that does reconfiguration 
for clients. For brokers, we can use standard dynamic broker configs for PEM.

> Support PEM format for SSL certificates and private key
> ---
>
> Key: KAFKA-10338
> URL: https://issues.apache.org/jira/browse/KAFKA-10338
> Project: Kafka
>  Issue Type: New Feature
>  Components: security
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>Priority: Major
> Fix For: 2.7.0
>
>
> We currently support only file-based JKS/PKCS12 format for SSL key stores and 
> trust stores. It will be good to add support for PEM as configuration values 
> that fits better with config externalization.
> KIP: 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-651+-+Support+PEM+format+for+SSL+certificates+and+private+key



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-10338) Support PEM format for SSL certificates and private key

2021-09-10 Thread Rajini Sivaram (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-10338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17413395#comment-17413395
 ] 

Rajini Sivaram commented on KAFKA-10338:


[~teabot] We currently don't have a way of updating PEM configs for clients 
unless they are stored externally in a file and the file is reloaded. It may be 
possible to add a custom `ssl.engine.factory.class` that does reconfiguration 
for clients. For brokers, we can use standard dynamic broker configs for PEM.

> Support PEM format for SSL certificates and private key
> ---
>
> Key: KAFKA-10338
> URL: https://issues.apache.org/jira/browse/KAFKA-10338
> Project: Kafka
>  Issue Type: New Feature
>  Components: security
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>Priority: Major
> Fix For: 2.7.0
>
>
> We currently support only file-based JKS/PKCS12 format for SSL key stores and 
> trust stores. It will be good to add support for PEM as configuration values 
> that fits better with config externalization.
> KIP: 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-651+-+Support+PEM+format+for+SSL+certificates+and+private+key



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-13277) Serialization of long tagged string in request/response throws BufferOverflowException

2021-09-08 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram updated KAFKA-13277:
---
Fix Version/s: 2.7.2

> Serialization of long tagged string in request/response throws 
> BufferOverflowException
> --
>
> Key: KAFKA-13277
> URL: https://issues.apache.org/jira/browse/KAFKA-13277
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>Priority: Blocker
> Fix For: 3.0.0, 2.7.2, 2.8.1
>
>
> Size computation for tagged strings in the message generator is incorrect and 
> hence it works only for small strings (126 bytes or so) where the length 
> happens to be correct.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-13277) Serialization of long tagged string in request/response throws BufferOverflowException

2021-09-08 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram updated KAFKA-13277:
---
Affects Version/s: 2.4.1
   2.5.1
   2.8.0
   2.7.1
   2.6.2

> Serialization of long tagged string in request/response throws 
> BufferOverflowException
> --
>
> Key: KAFKA-13277
> URL: https://issues.apache.org/jira/browse/KAFKA-13277
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.4.1, 2.5.1, 2.8.0, 2.7.1, 2.6.2
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>Priority: Blocker
> Fix For: 3.0.0, 2.7.2, 2.8.1
>
>
> Size computation for tagged strings in the message generator is incorrect and 
> hence it works only for small strings (126 bytes or so) where the length 
> happens to be correct.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-13277) Serialization of long tagged string in request/response throws BufferOverflowException

2021-09-08 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram updated KAFKA-13277:
---
Fix Version/s: 2.8.1

> Serialization of long tagged string in request/response throws 
> BufferOverflowException
> --
>
> Key: KAFKA-13277
> URL: https://issues.apache.org/jira/browse/KAFKA-13277
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>Priority: Blocker
> Fix For: 3.0.0, 2.8.1
>
>
> Size computation for tagged strings in the message generator is incorrect and 
> hence it works only for small strings (126 bytes or so) where the length 
> happens to be correct.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-13277) Serialization of long tagged string in request/response throws BufferOverflowException

2021-09-07 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram updated KAFKA-13277:
---
Fix Version/s: (was: 3.0.1)
   3.0.0

> Serialization of long tagged string in request/response throws 
> BufferOverflowException
> --
>
> Key: KAFKA-13277
> URL: https://issues.apache.org/jira/browse/KAFKA-13277
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>Priority: Major
> Fix For: 3.0.0
>
>
> Size computation for tagged strings in the message generator is incorrect and 
> hence it works only for small strings (126 bytes or so) where the length 
> happens to be correct.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-13277) Serialization of long tagged string in request/response throws BufferOverflowException

2021-09-07 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram updated KAFKA-13277:
---
Priority: Blocker  (was: Major)

> Serialization of long tagged string in request/response throws 
> BufferOverflowException
> --
>
> Key: KAFKA-13277
> URL: https://issues.apache.org/jira/browse/KAFKA-13277
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>Priority: Blocker
> Fix For: 3.0.0
>
>
> Size computation for tagged strings in the message generator is incorrect and 
> hence it works only for small strings (126 bytes or so) where the length 
> happens to be correct.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-13277) Serialization of long tagged string in request/response throws BufferOverflowException

2021-09-07 Thread Rajini Sivaram (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-13277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17411311#comment-17411311
 ] 

Rajini Sivaram commented on KAFKA-13277:


[~ijuma] Yes, it will be good include in 3.0.

> Serialization of long tagged string in request/response throws 
> BufferOverflowException
> --
>
> Key: KAFKA-13277
> URL: https://issues.apache.org/jira/browse/KAFKA-13277
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>Priority: Major
> Fix For: 3.0.1
>
>
> Size computation for tagged strings in the message generator is incorrect and 
> hence it works only for small strings (126 bytes or so) where the length 
> happens to be correct.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-13277) Serialization of long tagged string in request/response throws BufferOverflowException

2021-09-07 Thread Rajini Sivaram (Jira)
Rajini Sivaram created KAFKA-13277:
--

 Summary: Serialization of long tagged string in request/response 
throws BufferOverflowException
 Key: KAFKA-13277
 URL: https://issues.apache.org/jira/browse/KAFKA-13277
 Project: Kafka
  Issue Type: Bug
  Components: clients
Reporter: Rajini Sivaram
Assignee: Rajini Sivaram
 Fix For: 3.0.1


Size computation for tagged strings in the message generator is incorrect and 
hence it works only for small strings (126 bytes or so) where the length 
happens to be correct.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-10774) Support Describe topic using topic IDs

2021-08-28 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-10774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram resolved KAFKA-10774.

Fix Version/s: 3.1.0
 Reviewer: Rajini Sivaram
   Resolution: Fixed

> Support Describe topic using topic IDs
> --
>
> Key: KAFKA-10774
> URL: https://issues.apache.org/jira/browse/KAFKA-10774
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: dengziming
>Assignee: dengziming
>Priority: Major
> Fix For: 3.1.0
>
>
> Similar to KAFKA-10547 which add topic IDs in MetadataResp, we add topic IDs 
> to MetadataReq and can get TopicDesc by topic IDs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-13207) Replica fetcher should not update partition state on diverging epoch if partition removed from fetcher

2021-08-17 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram resolved KAFKA-13207.

Resolution: Fixed

> Replica fetcher should not update partition state on diverging epoch if 
> partition removed from fetcher
> --
>
> Key: KAFKA-13207
> URL: https://issues.apache.org/jira/browse/KAFKA-13207
> Project: Kafka
>  Issue Type: New Feature
>  Components: core
>Affects Versions: 2.8.0
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>Priority: Critical
> Fix For: 3.0.0, 2.8.1
>
>
> {{AbstractFetcherThread#truncateOnFetchResponse}}{color:#24292e} is used with 
> IBP 2.7 and above to truncate partitions based on diverging epoch returned in 
> fetch responses. Truncation should only be performed for partitions that are 
> still owned by the fetcher and this check should be done while holding 
> {color}{{partitionMapLock}}{color:#24292e} to ensure that any partitions 
> removed from the fetcher thread are not truncated{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-13207) Replica fetcher should not update partition state on diverging epoch if partition removed from fetcher

2021-08-16 Thread Rajini Sivaram (Jira)
Rajini Sivaram created KAFKA-13207:
--

 Summary: Replica fetcher should not update partition state on 
diverging epoch if partition removed from fetcher
 Key: KAFKA-13207
 URL: https://issues.apache.org/jira/browse/KAFKA-13207
 Project: Kafka
  Issue Type: New Feature
  Components: core
Affects Versions: 2.8.0
Reporter: Rajini Sivaram
Assignee: Rajini Sivaram
 Fix For: 3.0.0, 2.8.1


{{AbstractFetcherThread#truncateOnFetchResponse}}{color:#24292e} is used with 
IBP 2.7 and above to truncate partitions based on diverging epoch returned in 
fetch responses. Truncation should only be performed for partitions that are 
still owned by the fetcher and this check should be done while holding 
{color}{{partitionMapLock}}{color:#24292e} to ensure that any partitions 
removed from the fetcher thread are not truncated{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-13141) Leader should not update follower fetch offset if diverging epoch is present

2021-07-28 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram resolved KAFKA-13141.

  Reviewer: Jason Gustafson
Resolution: Fixed

> Leader should not update follower fetch offset if diverging epoch is present
> 
>
> Key: KAFKA-13141
> URL: https://issues.apache.org/jira/browse/KAFKA-13141
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.8.0, 2.7.1
>Reporter: Jason Gustafson
>Assignee: Rajini Sivaram
>Priority: Blocker
> Fix For: 3.0.0, 2.7.2, 2.8.1
>
>
> In 2.7, we began doing fetcher truncation piggybacked on the Fetch protocol 
> instead of using the old OffsetsForLeaderEpoch API. When truncation is 
> detected, we return a `divergingEpoch` field in the Fetch response, but we do 
> not set an error code. The sender is expected to check if the diverging epoch 
> is present and truncate accordingly.
> All of this works correctly in the fetcher implementation, but the problem is 
> that the logic to update the follower fetch position on the leader does not 
> take into account the diverging epoch present in the response. This means the 
> fetch offsets can be updated incorrectly, which can lead to either log 
> divergence or the loss of committed data.
> For example, we hit the following case with 3 replicas. Leader 1 is elected 
> in epoch 1 with an end offset of 100. The followers are at offset 101
> Broker 1: (Leader) Epoch 1 from offset 100
> Broker 2: (Follower) Epoch 1 from offset 101
> Broker 3: (Follower) Epoch 1 from offset 101
> Broker 1 receives fetches from 2 and 3 at offset 101. The leader detects the 
> divergence and returns a diverging epoch in the fetch state. Nevertheless, 
> the fetch positions for both followers are updated to 101 and the high 
> watermark is advanced.
> After brokers 2 and 3 had truncated to offset 100, broker 1 experienced a 
> network partition of some kind and was kicked from the ISR. This caused 
> broker 2 to get elected, which resulted in the following state at the start 
> of epoch 2.
> Broker 1: (Follower) Epoch 2 from offset 101
> Broker 2: (Leader) Epoch 2 from offset 100
> Broker 3: (Follower) Epoch 2 from offset 100
> Broker 2 was then able to write a new entry at offset 100 and the old record 
> which may have been exposed to consumers was deleted by broker 1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-13026) Idempotent producer (KAFKA-10619) follow-up testings

2021-07-26 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram resolved KAFKA-13026.

Fix Version/s: 3.0.0
 Reviewer: Rajini Sivaram
   Resolution: Fixed

> Idempotent producer (KAFKA-10619) follow-up testings
> 
>
> Key: KAFKA-13026
> URL: https://issues.apache.org/jira/browse/KAFKA-13026
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Cheng Tan
>Assignee: Cheng Tan
>Priority: Major
> Fix For: 3.0.0
>
>
> # Adjust config priority
>  # Adjust the JUnit tests so we get good coverage of the non-default behavior
>  # Similar to 2 for system tests



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KAFKA-13026) Idempotent producer (KAFKA-10619) follow-up testings

2021-07-26 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram reassigned KAFKA-13026:
--

Assignee: Cheng Tan

> Idempotent producer (KAFKA-10619) follow-up testings
> 
>
> Key: KAFKA-13026
> URL: https://issues.apache.org/jira/browse/KAFKA-13026
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Cheng Tan
>Assignee: Cheng Tan
>Priority: Major
>
> # Adjust config priority
>  # Adjust the JUnit tests so we get good coverage of the non-default behavior
>  # Similar to 2 for system tests



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-13043) Add Admin API for batched offset fetch requests (KIP-709)

2021-07-07 Thread Rajini Sivaram (Jira)
Rajini Sivaram created KAFKA-13043:
--

 Summary: Add Admin API for batched offset fetch requests (KIP-709)
 Key: KAFKA-13043
 URL: https://issues.apache.org/jira/browse/KAFKA-13043
 Project: Kafka
  Issue Type: New Feature
  Components: admin
Affects Versions: 3.0.0
Reporter: Rajini Sivaram
Assignee: Sanjana Kaundinya
 Fix For: 3.1.0


Protocol changes and broker-side changes to process batched OffsetFetchRequests 
were added under KAFKA-12234. This ticket is to add Admin API changes to use 
this feature.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-12234) Extend OffsetFetch requests to accept multiple group ids.

2021-07-07 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram resolved KAFKA-12234.

  Reviewer: Rajini Sivaram
Resolution: Fixed

> Extend OffsetFetch requests to accept multiple group ids.
> -
>
> Key: KAFKA-12234
> URL: https://issues.apache.org/jira/browse/KAFKA-12234
> Project: Kafka
>  Issue Type: Improvement
>Affects Versions: 2.8.0
>Reporter: Tom Scott
>Assignee: Sanjana Kaundinya
>Priority: Minor
>  Labels: needs-kip
> Fix For: 3.0.0
>
>
> More details are in the KIP: 
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=173084258



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-13029) FindCoordinators batching can break consumers during rolling upgrade

2021-07-07 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram resolved KAFKA-13029.

Resolution: Fixed

> FindCoordinators batching can break consumers during rolling upgrade
> 
>
> Key: KAFKA-13029
> URL: https://issues.apache.org/jira/browse/KAFKA-13029
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>Priority: Blocker
> Fix For: 3.0.0
>
>
> The changes made under KIP-699 assume that it is always safe to use unbatched 
> mode since we move from batched to unbatched and cache the value forever in 
> clients if a broker doesn't support batching. During rolling upgrade, if a 
> request is sent to an older broker, we move from batched to unbatched mode. 
> The consumer (admin client as well I think) disables batching and future 
> requests to upgraded brokers will fail because we attempt to use unbatched 
> requests with a newer version of the request.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-13029) FindCoordinators batching can break consumers during rolling upgrade

2021-07-03 Thread Rajini Sivaram (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-13029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17373991#comment-17373991
 ] 

Rajini Sivaram commented on KAFKA-13029:


[~mimaison] [~dajac]  Consumer and transaction coordinator always have single 
group, so the batch flag in those with failover are unnecessary. As David 
mentioned, we just need to set the appropriate data based on versions. The plan 
is to do the same for batching OffsetFetchRequest as well. For Admin Client, 
the PR currently switches to unbatched mode if there is an error. But the 
unbatched mode now works with any version (by setting appropriate data). So 
while performance could be improved in case an admin client gets into this 
state using the version of individual brokers, it didn't seem worthwhile making 
that change now for admin clients.

> FindCoordinators batching can break consumers during rolling upgrade
> 
>
> Key: KAFKA-13029
> URL: https://issues.apache.org/jira/browse/KAFKA-13029
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>Priority: Blocker
> Fix For: 3.0.0
>
>
> The changes made under KIP-699 assume that it is always safe to use unbatched 
> mode since we move from batched to unbatched and cache the value forever in 
> clients if a broker doesn't support batching. During rolling upgrade, if a 
> request is sent to an older broker, we move from batched to unbatched mode. 
> The consumer (admin client as well I think) disables batching and future 
> requests to upgraded brokers will fail because we attempt to use unbatched 
> requests with a newer version of the request.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-13029) FindCoordinators batching can break consumers during rolling upgrade

2021-07-02 Thread Rajini Sivaram (Jira)
Rajini Sivaram created KAFKA-13029:
--

 Summary: FindCoordinators batching can break consumers during 
rolling upgrade
 Key: KAFKA-13029
 URL: https://issues.apache.org/jira/browse/KAFKA-13029
 Project: Kafka
  Issue Type: Bug
  Components: consumer
Reporter: Rajini Sivaram
Assignee: Rajini Sivaram
 Fix For: 3.0.0


The changes made under KIP-699 assume that it is always safe to use unbatched 
mode since we move from batched to unbatched and cache the value forever in 
clients if a broker doesn't support batching. During rolling upgrade, if a 
request is sent to an older broker, we move from batched to unbatched mode. The 
consumer (admin client as well I think) disables batching and future requests 
to upgraded brokers will fail because we attempt to use unbatched requests with 
a newer version of the request.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-12996) OffsetOutOfRange not handled correctly for diverging epochs when fetch offset less than leader start offset

2021-06-29 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram updated KAFKA-12996:
---
Fix Version/s: 2.7.2
Affects Version/s: 2.7.1

> OffsetOutOfRange not handled correctly for diverging epochs when fetch offset 
> less than leader start offset
> ---
>
> Key: KAFKA-12996
> URL: https://issues.apache.org/jira/browse/KAFKA-12996
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.8.0, 2.7.1
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>Priority: Major
> Fix For: 3.0.0, 2.7.2, 2.8.1
>
>
> {color:#24292e}If fetchOffset < startOffset, we currently throw 
> OffsetOutOfRangeException when attempting to read from the log in the regular 
> case. But for diverging epochs, we return Errors.NONE with the new leader 
> start offset, hwm etc.. ReplicaFetcherThread throws OffsetOutOfRangeException 
> when processing responses with Errors.NONE if the leader's offsets in the 
> response are out of range and this moves the partition to failed state. We 
> should add a check for this case when processing fetch requests and ensure 
> OffsetOutOfRangeException is thrown regardless of epochs.{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-12996) OffsetOutOfRange not handled correctly for diverging epochs when fetch offset less than leader start offset

2021-06-29 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram resolved KAFKA-12996.

  Reviewer: Guozhang Wang
Resolution: Fixed

> OffsetOutOfRange not handled correctly for diverging epochs when fetch offset 
> less than leader start offset
> ---
>
> Key: KAFKA-12996
> URL: https://issues.apache.org/jira/browse/KAFKA-12996
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.8.0, 2.7.1
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>Priority: Major
> Fix For: 3.0.0, 2.7.2, 2.8.1
>
>
> {color:#24292e}If fetchOffset < startOffset, we currently throw 
> OffsetOutOfRangeException when attempting to read from the log in the regular 
> case. But for diverging epochs, we return Errors.NONE with the new leader 
> start offset, hwm etc.. ReplicaFetcherThread throws OffsetOutOfRangeException 
> when processing responses with Errors.NONE if the leader's offsets in the 
> response are out of range and this moves the partition to failed state. We 
> should add a check for this case when processing fetch requests and ensure 
> OffsetOutOfRangeException is thrown regardless of epochs.{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-12996) OffsetOutOfRange not handled correctly for diverging epochs when fetch offset less than leader start offset

2021-06-25 Thread Rajini Sivaram (Jira)
Rajini Sivaram created KAFKA-12996:
--

 Summary: OffsetOutOfRange not handled correctly for diverging 
epochs when fetch offset less than leader start offset
 Key: KAFKA-12996
 URL: https://issues.apache.org/jira/browse/KAFKA-12996
 Project: Kafka
  Issue Type: Bug
  Components: core
Affects Versions: 2.8.0
Reporter: Rajini Sivaram
Assignee: Rajini Sivaram
 Fix For: 3.0.0, 2.8.1


{color:#24292e}If fetchOffset < startOffset, we currently throw 
OffsetOutOfRangeException when attempting to read from the log in the regular 
case. But for diverging epochs, we return Errors.NONE with the new leader start 
offset, hwm etc.. ReplicaFetcherThread throws OffsetOutOfRangeException when 
processing responses with Errors.NONE if the leader's offsets in the response 
are out of range and this moves the partition to failed state. We should add a 
check for this case when processing fetch requests and ensure 
OffsetOutOfRangeException is thrown regardless of epochs.{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-12948) NetworkClient.close(node) with node in connecting state makes NetworkClient unusable

2021-06-15 Thread Rajini Sivaram (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-12948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363490#comment-17363490
 ] 

Rajini Sivaram commented on KAFKA-12948:


Yes, it was due to the changes from 
[https://cwiki.apache.org/confluence/display/KAFKA/KIP-601%3A+Configurable+socket+connection+timeout+in+NetworkClient]
 , which went into 2.7.

> NetworkClient.close(node) with node in connecting state makes NetworkClient 
> unusable
> 
>
> Key: KAFKA-12948
> URL: https://issues.apache.org/jira/browse/KAFKA-12948
> Project: Kafka
>  Issue Type: Bug
>  Components: network
>Affects Versions: 2.8.0, 2.7.1
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>Priority: Critical
> Fix For: 2.7.2, 2.8.1
>
>
> `NetworkClient.close(node)` closes the node and removes it from 
> `ClusterConnectionStates.nodeState`, but not from 
> `ClusterConnectionStates.connectingNodes`. Subsequent `NetworkClient.poll()` 
> invocations throw IllegalStateException and this leaves the NetworkClient in 
> an unusable state until the node is removed from connectionNodes or added to 
> nodeState. We don't use `NetworkClient.close(node)` in clients, but we use it 
> in clients started by brokers for replica fetcher and controller. Since 
> brokers use NetworkClientUtils.isReady() before establishing connections and 
> this invokes poll(), the NetworkClient never recovers.
> Exception stack trace:
> {code:java}
> java.lang.IllegalStateException: No entry found for connection 0
> at 
> org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:409)
> at 
> org.apache.kafka.clients.ClusterConnectionStates.isConnectionSetupTimeout(ClusterConnectionStates.java:446)
> at 
> org.apache.kafka.clients.ClusterConnectionStates.lambda$nodesWithConnectionSetupTimeout$0(ClusterConnectionStates.java:458)
> at 
> java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:174)
> at 
> java.util.HashMap$KeySpliterator.forEachRemaining(HashMap.java:1553)
> at 
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
> at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
> at 
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
> at 
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at 
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
> at 
> org.apache.kafka.clients.ClusterConnectionStates.nodesWithConnectionSetupTimeout(ClusterConnectionStates.java:459)
> at 
> org.apache.kafka.clients.NetworkClient.handleTimedOutConnections(NetworkClient.java:807)
> at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:564)
> at 
> org.apache.kafka.clients.NetworkClientUtils.isReady(NetworkClientUtils.java:42)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-12948) NetworkClient.close(node) with node in connecting state makes NetworkClient unusable

2021-06-15 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram resolved KAFKA-12948.

Fix Version/s: 3.0.0
 Reviewer: David Jacot
   Resolution: Fixed

> NetworkClient.close(node) with node in connecting state makes NetworkClient 
> unusable
> 
>
> Key: KAFKA-12948
> URL: https://issues.apache.org/jira/browse/KAFKA-12948
> Project: Kafka
>  Issue Type: Bug
>  Components: network
>Affects Versions: 2.8.0, 2.7.1
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>Priority: Critical
> Fix For: 3.0.0, 2.7.2, 2.8.1
>
>
> `NetworkClient.close(node)` closes the node and removes it from 
> `ClusterConnectionStates.nodeState`, but not from 
> `ClusterConnectionStates.connectingNodes`. Subsequent `NetworkClient.poll()` 
> invocations throw IllegalStateException and this leaves the NetworkClient in 
> an unusable state until the node is removed from connectionNodes or added to 
> nodeState. We don't use `NetworkClient.close(node)` in clients, but we use it 
> in clients started by brokers for replica fetcher and controller. Since 
> brokers use NetworkClientUtils.isReady() before establishing connections and 
> this invokes poll(), the NetworkClient never recovers.
> Exception stack trace:
> {code:java}
> java.lang.IllegalStateException: No entry found for connection 0
> at 
> org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:409)
> at 
> org.apache.kafka.clients.ClusterConnectionStates.isConnectionSetupTimeout(ClusterConnectionStates.java:446)
> at 
> org.apache.kafka.clients.ClusterConnectionStates.lambda$nodesWithConnectionSetupTimeout$0(ClusterConnectionStates.java:458)
> at 
> java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:174)
> at 
> java.util.HashMap$KeySpliterator.forEachRemaining(HashMap.java:1553)
> at 
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
> at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
> at 
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
> at 
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> at 
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
> at 
> org.apache.kafka.clients.ClusterConnectionStates.nodesWithConnectionSetupTimeout(ClusterConnectionStates.java:459)
> at 
> org.apache.kafka.clients.NetworkClient.handleTimedOutConnections(NetworkClient.java:807)
> at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:564)
> at 
> org.apache.kafka.clients.NetworkClientUtils.isReady(NetworkClientUtils.java:42)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-12948) NetworkClient.close(node) with node in connecting state makes NetworkClient unusable

2021-06-14 Thread Rajini Sivaram (Jira)
Rajini Sivaram created KAFKA-12948:
--

 Summary: NetworkClient.close(node) with node in connecting state 
makes NetworkClient unusable
 Key: KAFKA-12948
 URL: https://issues.apache.org/jira/browse/KAFKA-12948
 Project: Kafka
  Issue Type: Bug
  Components: network
Affects Versions: 2.7.1, 2.8.0
Reporter: Rajini Sivaram
Assignee: Rajini Sivaram
 Fix For: 2.7.2, 2.8.1


`NetworkClient.close(node)` closes the node and removes it from 
`ClusterConnectionStates.nodeState`, but not from 
`ClusterConnectionStates.connectingNodes`. Subsequent `NetworkClient.poll()` 
invocations throw IllegalStateException and this leaves the NetworkClient in an 
unusable state until the node is removed from connectionNodes or added to 
nodeState. We don't use `NetworkClient.close(node)` in clients, but we use it 
in clients started by brokers for replica fetcher and controller. Since brokers 
use NetworkClientUtils.isReady() before establishing connections and this 
invokes poll(), the NetworkClient never recovers.

Exception stack trace:
{code:java}
java.lang.IllegalStateException: No entry found for connection 0
at 
org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:409)
at 
org.apache.kafka.clients.ClusterConnectionStates.isConnectionSetupTimeout(ClusterConnectionStates.java:446)
at 
org.apache.kafka.clients.ClusterConnectionStates.lambda$nodesWithConnectionSetupTimeout$0(ClusterConnectionStates.java:458)
at 
java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:174)
at java.util.HashMap$KeySpliterator.forEachRemaining(HashMap.java:1553)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at 
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at 
java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at 
java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at 
org.apache.kafka.clients.ClusterConnectionStates.nodesWithConnectionSetupTimeout(ClusterConnectionStates.java:459)
at 
org.apache.kafka.clients.NetworkClient.handleTimedOutConnections(NetworkClient.java:807)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:564)
at 
org.apache.kafka.clients.NetworkClientUtils.isReady(NetworkClientUtils.java:42)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-12867) Trogdor ConsumeBenchWorker quits prematurely with maxMessages config

2021-06-02 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram resolved KAFKA-12867.

Fix Version/s: 3.0.0
 Reviewer: Rajini Sivaram
   Resolution: Fixed

> Trogdor ConsumeBenchWorker quits prematurely with maxMessages config
> 
>
> Key: KAFKA-12867
> URL: https://issues.apache.org/jira/browse/KAFKA-12867
> Project: Kafka
>  Issue Type: Bug
>Reporter: Kowshik Prakasam
>Assignee: Kowshik Prakasam
>Priority: Major
> Fix For: 3.0.0
>
>
> The trogdor 
> [ConsumeBenchWorker|https://github.com/apache/kafka/commits/fc405d792de12a50956195827eaf57bbf6c9/trogdor/src/main/java/org/apache/kafka/trogdor/workload/ConsumeBenchWorker.java]
>  has a bug. If one of the consumption tasks completes executing successfully 
> due to [maxMessages being 
> consumed|https://github.com/apache/kafka/blob/fc405d792de12a50956195827eaf57bbf6c9/trogdor/src/main/java/org/apache/kafka/trogdor/workload/ConsumeBenchWorker.java#L245],
>  then, the consumption task [notifies the 
> doneFuture|https://github.com/apache/kafka/blob/fc405d792de12a50956195827eaf57bbf6c9/trogdor/src/main/java/org/apache/kafka/trogdor/workload/ConsumeBenchWorker.java#L285]
>  causing the ConsumeBenchWorker to halt. This becomes a problem when more 
> than 1 consumption task is running in parallel, because the successful 
> completion of 1 of the tasks shuts down the entire worker while the other 
> tasks are still running. When the worker is shut down, it 
> [kills|https://github.com/apache/kafka/blob/fc405d792de12a50956195827eaf57bbf6c9/trogdor/src/main/java/org/apache/kafka/trogdor/workload/ConsumeBenchWorker.java#L482]
>  all the active consumption tasks, which is not the desired behavior.
> The fix is to not notify the doneFuture when 1 of the consumption tasks 
> complete without error. Instead, we should defer the notification to the 
> [CloseStatusUpdater|https://github.com/apache/kafka/blob/fc405d792de12a50956195827eaf57bbf6c9/trogdor/src/main/java/org/apache/kafka/trogdor/workload/ConsumeBenchWorker.java#L299]
>  thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KAFKA-12867) Trogdor ConsumeBenchWorker quits prematurely with maxMessages config

2021-06-02 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram reassigned KAFKA-12867:
--

Assignee: Kowshik Prakasam

> Trogdor ConsumeBenchWorker quits prematurely with maxMessages config
> 
>
> Key: KAFKA-12867
> URL: https://issues.apache.org/jira/browse/KAFKA-12867
> Project: Kafka
>  Issue Type: Bug
>Reporter: Kowshik Prakasam
>Assignee: Kowshik Prakasam
>Priority: Major
>
> The trogdor 
> [ConsumeBenchWorker|https://github.com/apache/kafka/commits/fc405d792de12a50956195827eaf57bbf6c9/trogdor/src/main/java/org/apache/kafka/trogdor/workload/ConsumeBenchWorker.java]
>  has a bug. If one of the consumption tasks completes executing successfully 
> due to [maxMessages being 
> consumed|https://github.com/apache/kafka/blob/fc405d792de12a50956195827eaf57bbf6c9/trogdor/src/main/java/org/apache/kafka/trogdor/workload/ConsumeBenchWorker.java#L245],
>  then, the consumption task [notifies the 
> doneFuture|https://github.com/apache/kafka/blob/fc405d792de12a50956195827eaf57bbf6c9/trogdor/src/main/java/org/apache/kafka/trogdor/workload/ConsumeBenchWorker.java#L285]
>  causing the ConsumeBenchWorker to halt. This becomes a problem when more 
> than 1 consumption task is running in parallel, because the successful 
> completion of 1 of the tasks shuts down the entire worker while the other 
> tasks are still running. When the worker is shut down, it 
> [kills|https://github.com/apache/kafka/blob/fc405d792de12a50956195827eaf57bbf6c9/trogdor/src/main/java/org/apache/kafka/trogdor/workload/ConsumeBenchWorker.java#L482]
>  all the active consumption tasks, which is not the desired behavior.
> The fix is to not notify the doneFuture when 1 of the consumption tasks 
> complete without error. Instead, we should defer the notification to the 
> [CloseStatusUpdater|https://github.com/apache/kafka/blob/fc405d792de12a50956195827eaf57bbf6c9/trogdor/src/main/java/org/apache/kafka/trogdor/workload/ConsumeBenchWorker.java#L299]
>  thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-12751) ISRs remain in in-flight state if proposed state is same as actual state

2021-05-18 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram resolved KAFKA-12751.

  Reviewer: David Arthur
Resolution: Fixed

> ISRs remain in in-flight state if proposed state is same as actual state
> 
>
> Key: KAFKA-12751
> URL: https://issues.apache.org/jira/browse/KAFKA-12751
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.7.0, 2.8.0, 2.7.1
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>Priority: Critical
> Fix For: 2.7.2, 2.8.1
>
>
> If proposed ISR state in an AlterIsr request is the same as the actual state, 
> Controller returns a successful response without performing any updates. But 
> the broker code that processes the response leaves the ISR state in in-flight 
> state without committing. This prevents further ISR updates until the next 
> leader election.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-12730) A single Kerberos login failure fails all future connections from Java 9 onwards

2021-05-05 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram updated KAFKA-12730:
---
Fix Version/s: 2.8.1
   2.7.2
   2.6.3
   2.5.2

> A single Kerberos login failure fails all future connections from Java 9 
> onwards
> 
>
> Key: KAFKA-12730
> URL: https://issues.apache.org/jira/browse/KAFKA-12730
> Project: Kafka
>  Issue Type: Bug
>  Components: security
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>Priority: Major
> Fix For: 3.0.0, 2.5.2, 2.6.3, 2.7.2, 2.8.1
>
>
> The refresh thread for Kerberos performs re-login by logging out and then 
> logging in again. If login fails, we retry after a backoff. Every iteration 
> of the loop performs loginContext.logout() and loginContext.login(). If login 
> fails, we end up with two consecutive logouts. This used to work, but from 
> Java 9 onwards, this results in a NullPointerException due to 
> https://bugs.openjdk.java.net/browse/JDK-8173069. We should check if logout 
> is required before attempting logout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-10727) Kafka clients throw AuthenticationException during Kerberos re-login

2021-05-05 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-10727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram updated KAFKA-10727:
---
Fix Version/s: 2.7.2
   2.6.3
   2.5.2

> Kafka clients throw AuthenticationException during Kerberos re-login
> 
>
> Key: KAFKA-10727
> URL: https://issues.apache.org/jira/browse/KAFKA-10727
> Project: Kafka
>  Issue Type: Bug
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>Priority: Major
> Fix For: 2.5.2, 2.8.0, 2.6.3, 2.7.2
>
>
> During Kerberos re-login, we log out and login again. There is a timing issue 
> where the principal in the Subject has been cleared, but a new one hasn't 
> been populated yet. We need to ensure that we don't throw 
> AuthenticationException in this case to avoid Kafka clients 
> (consumer/producer etc.) failing instead of retrying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-12751) ISRs remain in in-flight state if proposed state is same as actual state

2021-05-05 Thread Rajini Sivaram (Jira)
Rajini Sivaram created KAFKA-12751:
--

 Summary: ISRs remain in in-flight state if proposed state is same 
as actual state
 Key: KAFKA-12751
 URL: https://issues.apache.org/jira/browse/KAFKA-12751
 Project: Kafka
  Issue Type: Bug
  Components: core
Affects Versions: 2.8.0, 2.7.0, 2.7.1
Reporter: Rajini Sivaram
Assignee: Rajini Sivaram
 Fix For: 2.7.2, 2.8.1


If proposed ISR state in an AlterIsr request is the same as the actual state, 
Controller returns a successful response without performing any updates. But 
the broker code that processes the response leaves the ISR state in in-flight 
state without committing. This prevents further ISR updates until the next 
leader election.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-12730) A single Kerberos login failure fails all future connections from Java 9 onwards

2021-04-29 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram resolved KAFKA-12730.

  Reviewer: Manikumar
Resolution: Fixed

> A single Kerberos login failure fails all future connections from Java 9 
> onwards
> 
>
> Key: KAFKA-12730
> URL: https://issues.apache.org/jira/browse/KAFKA-12730
> Project: Kafka
>  Issue Type: Bug
>  Components: security
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>Priority: Major
> Fix For: 3.0.0
>
>
> The refresh thread for Kerberos performs re-login by logging out and then 
> logging in again. If login fails, we retry after a backoff. Every iteration 
> of the loop performs loginContext.logout() and loginContext.login(). If login 
> fails, we end up with two consecutive logouts. This used to work, but from 
> Java 9 onwards, this results in a NullPointerException due to 
> https://bugs.openjdk.java.net/browse/JDK-8173069. We should check if logout 
> is required before attempting logout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-12730) A single Kerberos login failure fails all future connections from Java 9 onwards

2021-04-29 Thread Rajini Sivaram (Jira)
Rajini Sivaram created KAFKA-12730:
--

 Summary: A single Kerberos login failure fails all future 
connections from Java 9 onwards
 Key: KAFKA-12730
 URL: https://issues.apache.org/jira/browse/KAFKA-12730
 Project: Kafka
  Issue Type: Bug
  Components: security
Reporter: Rajini Sivaram
Assignee: Rajini Sivaram
 Fix For: 3.0.0


The refresh thread for Kerberos performs re-login by logging out and then 
logging in again. If login fails, we retry after a backoff. Every iteration of 
the loop performs loginContext.logout() and loginContext.login(). If login 
fails, we end up with two consecutive logouts. This used to work, but from Java 
9 onwards, this results in a NullPointerException due to 
https://bugs.openjdk.java.net/browse/JDK-8173069. We should check if logout is 
required before attempting logout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-12598) Remove deprecated --zookeeper in ConfigCommand

2021-04-13 Thread Rajini Sivaram (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-12598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17320471#comment-17320471
 ] 

Rajini Sivaram commented on KAFKA-12598:


[~cmccabe] [~hachikuji] Have we figured out how/when to remove ZK options for 
commands used to bootstrap brokers?

> Remove deprecated --zookeeper in ConfigCommand
> --
>
> Key: KAFKA-12598
> URL: https://issues.apache.org/jira/browse/KAFKA-12598
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Luke Chen
>Assignee: Luke Chen
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-12479) Combine partition offset requests into single request in ConsumerGroupCommand

2021-03-23 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram resolved KAFKA-12479.

Fix Version/s: 3.0.0
 Reviewer: David Jacot
   Resolution: Fixed

> Combine partition offset requests into single request in ConsumerGroupCommand
> -
>
> Key: KAFKA-12479
> URL: https://issues.apache.org/jira/browse/KAFKA-12479
> Project: Kafka
>  Issue Type: Improvement
>  Components: tools
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>Priority: Major
> Fix For: 3.0.0
>
>
> We currently send one request per-partition of obtain offset information. For 
> a group with a large number of partitions, this can take several minutes. It 
> would be more efficient to send a single request containing all partitions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-12268) System tests broken because consumer returns early without records

2021-03-22 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram resolved KAFKA-12268.

Resolution: Fixed

> System tests broken because consumer returns early without records 
> ---
>
> Key: KAFKA-12268
> URL: https://issues.apache.org/jira/browse/KAFKA-12268
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Reporter: Rajini Sivaram
>Assignee: John Roesler
>Priority: Critical
> Fix For: 2.8.0
>
>
> https://issues.apache.org/jira/browse/KAFKA-10866 added metadata to 
> ConsumerRecords. We add metadata even when there are no records. As a result, 
> we sometimes return early from KafkaConsumer#poll() with no records because 
> FetchedRecords.isEmpty returns false if either metadata or records are 
> available. This breaks system tests which rely on poll timeout, expecting 
> records to be returned on every poll when there is no timeout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KAFKA-12268) System tests broken because consumer returns early without records

2021-03-22 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram reassigned KAFKA-12268:
--

Assignee: John Roesler  (was: Rajini Sivaram)

> System tests broken because consumer returns early without records 
> ---
>
> Key: KAFKA-12268
> URL: https://issues.apache.org/jira/browse/KAFKA-12268
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Reporter: Rajini Sivaram
>Assignee: John Roesler
>Priority: Critical
> Fix For: 2.8.0
>
>
> https://issues.apache.org/jira/browse/KAFKA-10866 added metadata to 
> ConsumerRecords. We add metadata even when there are no records. As a result, 
> we sometimes return early from KafkaConsumer#poll() with no records because 
> FetchedRecords.isEmpty returns false if either metadata or records are 
> available. This breaks system tests which rely on poll timeout, expecting 
> records to be returned on every poll when there is no timeout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-12330) FetchSessionCache may cause starvation for partitions when FetchResponse is full

2021-03-16 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram resolved KAFKA-12330.

Fix Version/s: 3.0.0
   Resolution: Fixed

> FetchSessionCache may cause starvation for partitions when FetchResponse is 
> full
> 
>
> Key: KAFKA-12330
> URL: https://issues.apache.org/jira/browse/KAFKA-12330
> Project: Kafka
>  Issue Type: Bug
>Reporter: Lucas Bradstreet
>Assignee: David Jacot
>Priority: Major
> Fix For: 3.0.0
>
>
> The incremental FetchSessionCache sessions deprioritizes partitions where a 
> response is returned. This may happen if log metadata such as log start 
> offset, hwm, etc is returned, or if data for that partition is returned.
> When a fetch response fills to maxBytes, data may not be returned for 
> partitions even if the fetch offset is lower than the fetch upper bound. 
> However, the fetch response will still contain updates to metadata such as 
> hwm if that metadata has changed. This can lead to degenerate behavior where 
> a partition's hwm or log start offset is updated resulting in the next fetch 
> being unnecessarily skipped for that partition. At first this appeared to be 
> worse, as hwm updates occur frequently, but starvation should result in hwm 
> movement becoming blocked, allowing a fetch to go through and then becoming 
> unstuck. However, it'll still require one more fetch request than necessary 
> to do so. Consumers may be affected more than replica fetchers, however they 
> often remove partitions with fetched data from the next fetch request and 
> this may be helping prevent starvation.
> I believe we should only reorder the partition fetch priority if data is 
> actually returned for a partition.
> {noformat}
> private class PartitionIterator(val iter: FetchSession.RESP_MAP_ITER,
> val updateFetchContextAndRemoveUnselected: 
> Boolean)
>   extends FetchSession.RESP_MAP_ITER {
>   var nextElement: util.Map.Entry[TopicPartition, 
> FetchResponse.PartitionData[Records]] = null
>   override def hasNext: Boolean = {
> while ((nextElement == null) && iter.hasNext) {
>   val element = iter.next()
>   val topicPart = element.getKey
>   val respData = element.getValue
>   val cachedPart = session.partitionMap.find(new 
> CachedPartition(topicPart))
>   val mustRespond = cachedPart.maybeUpdateResponseData(respData, 
> updateFetchContextAndRemoveUnselected)
>   if (mustRespond) {
> nextElement = element
> // Example POC change:
> // Don't move partition to end of queue if we didn't actually fetch 
> data
> // This should help avoid starvation even when we are filling the 
> fetch response fully while returning metadata for these partitions
> if (updateFetchContextAndRemoveUnselected && respData.records != null 
> && respData.records.sizeInBytes > 0) {
>   session.partitionMap.remove(cachedPart)
>   session.partitionMap.mustAdd(cachedPart)
> }
>   } else {
> if (updateFetchContextAndRemoveUnselected) {
>   iter.remove()
> }
>   }
> }
> nextElement != null
>   }{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-12427) Broker does not close muted idle connections with buffered data

2021-03-16 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram resolved KAFKA-12427.

Fix Version/s: 3.0.0
 Reviewer: Rajini Sivaram
   Resolution: Fixed

> Broker does not close muted idle connections with buffered data
> ---
>
> Key: KAFKA-12427
> URL: https://issues.apache.org/jira/browse/KAFKA-12427
> Project: Kafka
>  Issue Type: Bug
>  Components: core, network
>Reporter: David Mao
>Assignee: David Mao
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-12479) Combine partition offset requests into single request in ConsumerGroupCommand

2021-03-16 Thread Rajini Sivaram (Jira)
Rajini Sivaram created KAFKA-12479:
--

 Summary: Combine partition offset requests into single request in 
ConsumerGroupCommand
 Key: KAFKA-12479
 URL: https://issues.apache.org/jira/browse/KAFKA-12479
 Project: Kafka
  Issue Type: Improvement
  Components: tools
Reporter: Rajini Sivaram
Assignee: Rajini Sivaram


We currently send one request per-partition of obtain offset information. For a 
group with a large number of partitions, this can take several minutes. It 
would be more efficient to send a single request containing all partitions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-12431) Fetch Request/Response without Topic information

2021-03-09 Thread Rajini Sivaram (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-12431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297989#comment-17297989
 ] 

Rajini Sivaram commented on KAFKA-12431:


I can't think of any changes in 2.5.x or 2.6.x that explains this behaviour. 
Looking at the code, I think we don't handle arithmetic overflows. So zero 
quota may happen to result in no quota enforcement as a result of overflows 
that set throttleTimeMs to a negative value. But this code doesn't seem to have 
changed much since the older versions. Empty fetch response for quota violation 
was introduced in 2.0.0 under KIP-219. In 2.6.0, we improved on this by sending 
partial responses under KAFKA-9677.  That may not help with very low quotas, 
but shouldn't have caused any degradation compared to before. In any case, this 
shouldn't have altered the handling of quota=0.

> Fetch Request/Response without Topic information
> 
>
> Key: KAFKA-12431
> URL: https://issues.apache.org/jira/browse/KAFKA-12431
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.6.1
>Reporter: Peter Sinoros-Szabo
>Priority: Major
> Attachments: fetch-on-2.4.1.png, fetch-on-2.6.1.png, 
> kafka-highcpu-24.svg.zip, kafka-highcpu-26.svg.zip
>
>
> I was running a 6 node Kafka 2.4.1 cluster with protocol and message format 
> version set to 2.4. I wanted to upgrade the cluster to 2.6.1 and after I 
> upgraded the 1st broker to 2.6.1 without any configuration change, I noticed 
> much higher CPU usage on that broker (instead of 25% CPU usage it was  ~350%) 
> and about 3-4x higher network traffic. So I dumped the traffic between the 
> Kafka client and broker and compared it with the traffic of the same broker 
> downgraded to 2.4.1.
> It seems to me that after I upgraded to 2.6.1, the Fetch requests and 
> responses are not complete, it is missing the topics part of the Fetch 
> Request, I don't know for what reason. I guess there should be always a 
> topics part.
> I'll attache a screenshot from these messages.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-12254) MirrorMaker 2.0 creates destination topic with default configs

2021-03-02 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram updated KAFKA-12254:
---
Fix Version/s: (was: 3.0.0)

> MirrorMaker 2.0 creates destination topic with default configs
> --
>
> Key: KAFKA-12254
> URL: https://issues.apache.org/jira/browse/KAFKA-12254
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.4.0, 2.5.0, 2.4.1, 2.6.0, 2.5.1, 2.7.0, 2.6.1, 2.8.0
>Reporter: Dhruvil Shah
>Assignee: Dhruvil Shah
>Priority: Blocker
> Fix For: 2.8.0
>
>
> `MirrorSourceConnector` implements the logic for replicating data, 
> configurations, and other metadata between the source and destination 
> clusters. This includes the tasks below:
>  # `refreshTopicPartitions` for syncing topics / partitions from source to 
> destination.
>  # `syncTopicConfigs` for syncing topic configurations from source to 
> destination.
> A limitation is that `computeAndCreateTopicPartitions` creates topics with 
> default configurations on the destination cluster. A separate async task 
> `syncTopicConfigs` is responsible for syncing the topic configs. Before that 
> sync happens, topic configurations could be out of sync between the two 
> clusters.
> In the worst case, this could lead to data loss eg. when we have a compacted 
> topic being mirrored between clusters which is incorrectly created with the 
> default configuration of `cleanup.policy = delete` on the destination before 
> the configurations are sync'd via `syncTopicConfigs`.
> Here is an example of the divergence:
> Source Topic:
> ```
> Topic: foobar PartitionCount: 1 ReplicationFactor: 1 Configs: 
> cleanup.policy=compact,segment.bytes=1073741824
> ```
> Destination Topic:
> ```
> Topic: A.foobar PartitionCount: 1 ReplicationFactor: 1 Configs: 
> segment.bytes=1073741824
> ```
> A safer approach is to ensure that the right configurations are set on the 
> destination cluster before data is replicated to it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-10700) Support mutual TLS authentication for SASL_SSL listeners

2021-02-02 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-10700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram resolved KAFKA-10700.

  Reviewer: David Jacot
Resolution: Fixed

> Support mutual TLS authentication for SASL_SSL listeners
> 
>
> Key: KAFKA-10700
> URL: https://issues.apache.org/jira/browse/KAFKA-10700
> Project: Kafka
>  Issue Type: New Feature
>  Components: security
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>Priority: Major
> Fix For: 2.8.0
>
>
> See 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-684+-+Support+mutual+TLS+authentication+on+SASL_SSL+listeners
>  for details



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-12268) System tests broken because consumer returns early without records

2021-02-02 Thread Rajini Sivaram (Jira)
Rajini Sivaram created KAFKA-12268:
--

 Summary: System tests broken because consumer returns early 
without records 
 Key: KAFKA-12268
 URL: https://issues.apache.org/jira/browse/KAFKA-12268
 Project: Kafka
  Issue Type: Bug
  Components: consumer
Reporter: Rajini Sivaram
Assignee: Rajini Sivaram
 Fix For: 2.8.0


https://issues.apache.org/jira/browse/KAFKA-10866 added metadata to 
ConsumerRecords. We add metadata even when there are no records. As a result, 
we sometimes return early from KafkaConsumer#poll() with no records because 
FetchedRecords.isEmpty returns false if either metadata or records are 
available. This breaks system tests which rely on poll timeout, expecting 
records to be returned on every poll when there is no timeout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-10798) Failed authentication delay doesn't work with some SASL authentication failures

2020-12-16 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-10798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram updated KAFKA-10798:
---
Fix Version/s: 2.6.2
   2.7.1

> Failed authentication delay doesn't work with some SASL authentication 
> failures
> ---
>
> Key: KAFKA-10798
> URL: https://issues.apache.org/jira/browse/KAFKA-10798
> Project: Kafka
>  Issue Type: Bug
>  Components: security
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>Priority: Major
> Fix For: 2.8.0, 2.7.1, 2.6.2
>
>
> KIP-306 introduced the config `connection.failed.authentication.delay.ms` to 
> delay connection closing on brokers for failed authentication to limit the 
> rate of retried authentications from clients in order to avoid excessive 
> authentication load on brokers from failed clients. We rely on authentication 
> failure response to be delayed in this case to prevent clients from detecting 
> the failure and retrying sooner.
> SaslServerAuthenticator delays response for SaslAuthenticationException, but 
> not for SaslException, even though SaslException is also converted into 
> SaslAuthenticationException and processed as an authentication failure by 
> both server and clients. As a result, connection delay is not applied in many 
> scenarios like SCRAM authentication failures.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-10798) Failed authentication delay doesn't work with some SASL authentication failures

2020-12-07 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-10798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram resolved KAFKA-10798.

  Reviewer: Manikumar
Resolution: Fixed

> Failed authentication delay doesn't work with some SASL authentication 
> failures
> ---
>
> Key: KAFKA-10798
> URL: https://issues.apache.org/jira/browse/KAFKA-10798
> Project: Kafka
>  Issue Type: Bug
>  Components: security
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>Priority: Major
> Fix For: 2.8.0
>
>
> KIP-306 introduced the config `connection.failed.authentication.delay.ms` to 
> delay connection closing on brokers for failed authentication to limit the 
> rate of retried authentications from clients in order to avoid excessive 
> authentication load on brokers from failed clients. We rely on authentication 
> failure response to be delayed in this case to prevent clients from detecting 
> the failure and retrying sooner.
> SaslServerAuthenticator delays response for SaslAuthenticationException, but 
> not for SaslException, even though SaslException is also converted into 
> SaslAuthenticationException and processed as an authentication failure by 
> both server and clients. As a result, connection delay is not applied in many 
> scenarios like SCRAM authentication failures.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-10554) Perform follower truncation based on epoch offsets returned in Fetch response

2020-12-03 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-10554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram resolved KAFKA-10554.

  Reviewer: Jason Gustafson
Resolution: Fixed

> Perform follower truncation based on epoch offsets returned in Fetch response
> -
>
> Key: KAFKA-10554
> URL: https://issues.apache.org/jira/browse/KAFKA-10554
> Project: Kafka
>  Issue Type: Task
>  Components: replication
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>Priority: Major
> Fix For: 2.8.0
>
>
> KAFKA-10435 updated fetch protocol for KIP-595 to return diverging epoch and 
> offset as part of fetch response. We can use this to truncate logs in 
> followers while processing fetch responses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-10798) Failed authentication delay doesn't work with some SASL authentication failures

2020-12-02 Thread Rajini Sivaram (Jira)
Rajini Sivaram created KAFKA-10798:
--

 Summary: Failed authentication delay doesn't work with some SASL 
authentication failures
 Key: KAFKA-10798
 URL: https://issues.apache.org/jira/browse/KAFKA-10798
 Project: Kafka
  Issue Type: Bug
  Components: security
Reporter: Rajini Sivaram
Assignee: Rajini Sivaram
 Fix For: 2.8.0


KIP-306 introduced the config `connection.failed.authentication.delay.ms` to 
delay connection closing on brokers for failed authentication to limit the rate 
of retried authentications from clients in order to avoid excessive 
authentication load on brokers from failed clients. We rely on authentication 
failure response to be delayed in this case to prevent clients from detecting 
the failure and retrying sooner.

SaslServerAuthenticator delays response for SaslAuthenticationException, but 
not for SaslException, even though SaslException is also converted into 
SaslAuthenticationException and processed as an authentication failure by both 
server and clients. As a result, connection delay is not applied in many 
scenarios like SCRAM authentication failures.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-10727) Kafka clients throw AuthenticationException during Kerberos re-login

2020-11-23 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-10727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram resolved KAFKA-10727.

Fix Version/s: 2.8.0
 Reviewer: Manikumar
   Resolution: Fixed

> Kafka clients throw AuthenticationException during Kerberos re-login
> 
>
> Key: KAFKA-10727
> URL: https://issues.apache.org/jira/browse/KAFKA-10727
> Project: Kafka
>  Issue Type: Bug
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>Priority: Major
> Fix For: 2.8.0
>
>
> During Kerberos re-login, we log out and login again. There is a timing issue 
> where the principal in the Subject has been cleared, but a new one hasn't 
> been populated yet. We need to ensure that we don't throw 
> AuthenticationException in this case to avoid Kafka clients 
> (consumer/producer etc.) failing instead of retrying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-10727) Kafka clients throw AuthenticationException during Kerberos re-login

2020-11-16 Thread Rajini Sivaram (Jira)
Rajini Sivaram created KAFKA-10727:
--

 Summary: Kafka clients throw AuthenticationException during 
Kerberos re-login
 Key: KAFKA-10727
 URL: https://issues.apache.org/jira/browse/KAFKA-10727
 Project: Kafka
  Issue Type: Bug
Reporter: Rajini Sivaram
Assignee: Rajini Sivaram


During Kerberos re-login, we log out and login again. There is a timing issue 
where the principal in the Subject has been cleared, but a new one hasn't been 
populated yet. We need to ensure that we don't throw AuthenticationException in 
this case to avoid Kafka clients (consumer/producer etc.) failing instead of 
retrying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-10700) Support mutual TLS authentication for SASL_SSL listeners

2020-11-09 Thread Rajini Sivaram (Jira)
Rajini Sivaram created KAFKA-10700:
--

 Summary: Support mutual TLS authentication for SASL_SSL listeners
 Key: KAFKA-10700
 URL: https://issues.apache.org/jira/browse/KAFKA-10700
 Project: Kafka
  Issue Type: New Feature
  Components: security
Reporter: Rajini Sivaram
Assignee: Rajini Sivaram
 Fix For: 2.8.0


See 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-684+-+Support+mutual+TLS+authentication+on+SASL_SSL+listeners
 for details



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-7987) a broker's ZK session may die on transient auth failure

2020-11-04 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram resolved KAFKA-7987.
---
Fix Version/s: 2.8.0
 Reviewer: Jun Rao
   Resolution: Fixed

> a broker's ZK session may die on transient auth failure
> ---
>
> Key: KAFKA-7987
> URL: https://issues.apache.org/jira/browse/KAFKA-7987
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jun Rao
>Priority: Critical
> Fix For: 2.8.0
>
>
> After a transient network issue, we saw the following log in a broker.
> {code:java}
> [23:37:02,102] ERROR SASL authentication with Zookeeper Quorum member failed: 
> javax.security.sasl.SaslException: An error: 
> (java.security.PrivilegedActionException: javax.security.sasl.SaslException: 
> GSS initiate failed [Caused by GSSException: No valid credentials provided 
> (Mechanism level: Server not found in Kerberos database (7))]) occurred when 
> evaluating Zookeeper Quorum Member's received SASL token. Zookeeper Client 
> will go to AUTH_FAILED state. (org.apache.zookeeper.ClientCnxn)
> [23:37:02,102] ERROR [ZooKeeperClient] Auth failed. 
> (kafka.zookeeper.ZooKeeperClient)
> {code}
> The network issue prevented the broker from communicating to ZK. The broker's 
> ZK session then expired, but the broker didn't know that yet since it 
> couldn't establish a connection to ZK. When the network was back, the broker 
> tried to establish a connection to ZK, but failed due to auth failure (likely 
> due to a transient KDC issue). The current logic just ignores the auth 
> failure without trying to create a new ZK session. Then the broker will be 
> permanently in a state that it's alive, but not registered in ZK.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-7987) a broker's ZK session may die on transient auth failure

2020-10-27 Thread Rajini Sivaram (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-7987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221756#comment-17221756
 ] 

Rajini Sivaram commented on KAFKA-7987:
---

[~junrao]Thank you! I will follow up on the PR.

> a broker's ZK session may die on transient auth failure
> ---
>
> Key: KAFKA-7987
> URL: https://issues.apache.org/jira/browse/KAFKA-7987
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jun Rao
>Priority: Critical
>
> After a transient network issue, we saw the following log in a broker.
> {code:java}
> [23:37:02,102] ERROR SASL authentication with Zookeeper Quorum member failed: 
> javax.security.sasl.SaslException: An error: 
> (java.security.PrivilegedActionException: javax.security.sasl.SaslException: 
> GSS initiate failed [Caused by GSSException: No valid credentials provided 
> (Mechanism level: Server not found in Kerberos database (7))]) occurred when 
> evaluating Zookeeper Quorum Member's received SASL token. Zookeeper Client 
> will go to AUTH_FAILED state. (org.apache.zookeeper.ClientCnxn)
> [23:37:02,102] ERROR [ZooKeeperClient] Auth failed. 
> (kafka.zookeeper.ZooKeeperClient)
> {code}
> The network issue prevented the broker from communicating to ZK. The broker's 
> ZK session then expired, but the broker didn't know that yet since it 
> couldn't establish a connection to ZK. When the network was back, the broker 
> tried to establish a connection to ZK, but failed due to auth failure (likely 
> due to a transient KDC issue). The current logic just ignores the auth 
> failure without trying to create a new ZK session. Then the broker will be 
> permanently in a state that it's alive, but not registered in ZK.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-7987) a broker's ZK session may die on transient auth failure

2020-10-27 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram updated KAFKA-7987:
--
Issue Type: Bug  (was: Improvement)
  Priority: Critical  (was: Major)

> a broker's ZK session may die on transient auth failure
> ---
>
> Key: KAFKA-7987
> URL: https://issues.apache.org/jira/browse/KAFKA-7987
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jun Rao
>Priority: Critical
>
> After a transient network issue, we saw the following log in a broker.
> {code:java}
> [23:37:02,102] ERROR SASL authentication with Zookeeper Quorum member failed: 
> javax.security.sasl.SaslException: An error: 
> (java.security.PrivilegedActionException: javax.security.sasl.SaslException: 
> GSS initiate failed [Caused by GSSException: No valid credentials provided 
> (Mechanism level: Server not found in Kerberos database (7))]) occurred when 
> evaluating Zookeeper Quorum Member's received SASL token. Zookeeper Client 
> will go to AUTH_FAILED state. (org.apache.zookeeper.ClientCnxn)
> [23:37:02,102] ERROR [ZooKeeperClient] Auth failed. 
> (kafka.zookeeper.ZooKeeperClient)
> {code}
> The network issue prevented the broker from communicating to ZK. The broker's 
> ZK session then expired, but the broker didn't know that yet since it 
> couldn't establish a connection to ZK. When the network was back, the broker 
> tried to establish a connection to ZK, but failed due to auth failure (likely 
> due to a transient KDC issue). The current logic just ignores the auth 
> failure without trying to create a new ZK session. Then the broker will be 
> permanently in a state that it's alive, but not registered in ZK.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-7987) a broker's ZK session may die on transient auth failure

2020-10-27 Thread Rajini Sivaram (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-7987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221328#comment-17221328
 ] 

Rajini Sivaram commented on KAFKA-7987:
---

[~junrao] This is still an open issue for all versions of Kafka, right? I am 
looking into an authorizer issue where authorizer notifications were not 
processed for a long time. Heap dump shows that the authorizer's 
ZookeeperClient is in AUTH_FAILED state. ZK is Kerberos-enabled and there are a 
couple of authentication failures in the logs due to clock-skew errors, which 
look like the reason why the authorizer's ZooKeeperClient got into this state. 
For the authorizer, we do need to schedule retries in this case. But the issue 
doesn't seem to have affected other operations of the broker in this case, 
presumably because we retry connections for the other ZooKeeperClient when 
there are requests. We should still apply the retry fix to the common 
ZooKeeperClient, right? Since the affected broker didn't pick up ACL updates 
made on other brokers, it is a critical security issue. But I wanted to check 
if we have applied any fixes in this area in newer versions of Kafka. Thanks.


> a broker's ZK session may die on transient auth failure
> ---
>
> Key: KAFKA-7987
> URL: https://issues.apache.org/jira/browse/KAFKA-7987
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Jun Rao
>Priority: Major
>
> After a transient network issue, we saw the following log in a broker.
> {code:java}
> [23:37:02,102] ERROR SASL authentication with Zookeeper Quorum member failed: 
> javax.security.sasl.SaslException: An error: 
> (java.security.PrivilegedActionException: javax.security.sasl.SaslException: 
> GSS initiate failed [Caused by GSSException: No valid credentials provided 
> (Mechanism level: Server not found in Kerberos database (7))]) occurred when 
> evaluating Zookeeper Quorum Member's received SASL token. Zookeeper Client 
> will go to AUTH_FAILED state. (org.apache.zookeeper.ClientCnxn)
> [23:37:02,102] ERROR [ZooKeeperClient] Auth failed. 
> (kafka.zookeeper.ZooKeeperClient)
> {code}
> The network issue prevented the broker from communicating to ZK. The broker's 
> ZK session then expired, but the broker didn't know that yet since it 
> couldn't establish a connection to ZK. When the network was back, the broker 
> tried to establish a connection to ZK, but failed due to auth failure (likely 
> due to a transient KDC issue). The current logic just ignores the auth 
> failure without trying to create a new ZK session. Then the broker will be 
> permanently in a state that it's alive, but not registered in ZK.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-10626) Add support for max.in.flight.requests.per.connection behaviour in MockClient

2020-10-22 Thread Rajini Sivaram (Jira)
Rajini Sivaram created KAFKA-10626:
--

 Summary: Add support for max.in.flight.requests.per.connection 
behaviour  in MockClient
 Key: KAFKA-10626
 URL: https://issues.apache.org/jira/browse/KAFKA-10626
 Project: Kafka
  Issue Type: Task
  Components: unit tests
Reporter: Rajini Sivaram


We currently don't have an easy way to test 
max.in.flight.requests.per.connection behaviour in unit tests. It will be 
useful to add this for tests like the one in KAFKA-10520 with max.in.flight=1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-10233) KafkaConsumer polls in a tight loop if group is not authorized

2020-10-22 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram updated KAFKA-10233:
---
Fix Version/s: (was: 2.7.0)
   2.8.0

> KafkaConsumer polls in a tight loop if group is not authorized
> --
>
> Key: KAFKA-10233
> URL: https://issues.apache.org/jira/browse/KAFKA-10233
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>Priority: Major
> Fix For: 2.8.0
>
>
> Consumer propagates GroupAuthorizationException from poll immediately when 
> trying to find coordinator even though it is a retriable exception. If the 
> application polls in a loop, ignoring retriable exceptions, the consumer 
> tries to find coordinator in a tight loop without any backoff. We should 
> apply retry backoff in this case to avoid overloading brokers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-10520) InitProducerId may be blocked if least loaded node is not ready to send

2020-10-07 Thread Rajini Sivaram (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-10520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17209737#comment-17209737
 ] 

Rajini Sivaram commented on KAFKA-10520:


[~ableegoldman] Yes, will try and get this done in time for 2.7.0 code freeze.

> InitProducerId may be blocked if least loaded node is not ready to send
> ---
>
> Key: KAFKA-10520
> URL: https://issues.apache.org/jira/browse/KAFKA-10520
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>Priority: Major
> Fix For: 2.7.0
>
>
> From the logs of a failing producer that shows InitProducerId timing out 
> after request timeout, it looks like we don't poll while waiting for 
> transactional producer to be initialized and FindCoordinator request cannot 
> be sent. The producer configuration used one bootstrap server and 
> `max.in.flight.requests.per.connection=1`. The failing sequence:
>  # Producer sends MetadataRequest to least loaded node (bootstrap server)
>  # Producer is ready to send InitProducerId, needs to find transaction 
> coordinator
>  # Producer creates FindCoordinator request, but the only node known is the 
> bootstrap server. Producer cannot send to this node since there is already 
> the Metadata request in flight and max.inflight is 1.
>  # Producer waits without polling, so Metadata response is not processed. 
> InitProducerId times out eventually.
>   
>  
>  We need to update the condition used to determine whether Sender should 
> poll() to fix this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KAFKA-10520) InitProducerId may be blocked if least loaded node is not ready to send

2020-10-07 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-10520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram reassigned KAFKA-10520:
--

Assignee: Rajini Sivaram

> InitProducerId may be blocked if least loaded node is not ready to send
> ---
>
> Key: KAFKA-10520
> URL: https://issues.apache.org/jira/browse/KAFKA-10520
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>Priority: Major
> Fix For: 2.7.0
>
>
> From the logs of a failing producer that shows InitProducerId timing out 
> after request timeout, it looks like we don't poll while waiting for 
> transactional producer to be initialized and FindCoordinator request cannot 
> be sent. The producer configuration used one bootstrap server and 
> `max.in.flight.requests.per.connection=1`. The failing sequence:
>  # Producer sends MetadataRequest to least loaded node (bootstrap server)
>  # Producer is ready to send InitProducerId, needs to find transaction 
> coordinator
>  # Producer creates FindCoordinator request, but the only node known is the 
> bootstrap server. Producer cannot send to this node since there is already 
> the Metadata request in flight and max.inflight is 1.
>  # Producer waits without polling, so Metadata response is not processed. 
> InitProducerId times out eventually.
>   
>  
>  We need to update the condition used to determine whether Sender should 
> poll() to fix this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-9497) Brokers start up even if SASL provider is not loaded and throw NPE when clients connect

2020-10-07 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram resolved KAFKA-9497.
---
Resolution: Duplicate

> Brokers start up even if SASL provider is not loaded and throw NPE when 
> clients connect
> ---
>
> Key: KAFKA-9497
> URL: https://issues.apache.org/jira/browse/KAFKA-9497
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.2.2, 0.11.0.3, 1.1.1, 2.4.0
>Reporter: Rajini Sivaram
>Assignee: Ron Dagostino
>Priority: Major
> Fix For: 2.7.0
>
>
> Note: This is not a regression, this has been the behaviour since SASL was 
> first implemented in Kafka.
>  
> Sasl.createSaslServer and Sasl.createSaslClient may return null if a SASL 
> provider that works for the specified configs cannot be created. We don't 
> currently handle this case. As a result broker/client throws 
> NullPointerException if a provider has not been loaded. On the broker-side, 
> we allow brokers to start up successfully even if SASL provider for its 
> enabled mechanisms are not found. For SASL mechanisms 
> PLAIN/SCRAM-xx/OAUTHBEARER, the login module in Kafka loads the SASL 
> providers. If the login module is incorrectly configured, brokers startup and 
> then fail client connections when hitting NPE. Clients see disconnections 
> during authentication as a result. It is difficult to tell from the client or 
> broker logs why the failure occurred. We should fail during startup if SASL 
> providers are not found and provide better diagnostics for this case.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KAFKA-9497) Brokers start up even if SASL provider is not loaded and throw NPE when clients connect

2020-10-07 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram reassigned KAFKA-9497:
-

Assignee: Ron Dagostino  (was: Rajini Sivaram)

> Brokers start up even if SASL provider is not loaded and throw NPE when 
> clients connect
> ---
>
> Key: KAFKA-9497
> URL: https://issues.apache.org/jira/browse/KAFKA-9497
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.2.2, 0.11.0.3, 1.1.1, 2.4.0
>Reporter: Rajini Sivaram
>Assignee: Ron Dagostino
>Priority: Major
> Fix For: 2.7.0
>
>
> Note: This is not a regression, this has been the behaviour since SASL was 
> first implemented in Kafka.
>  
> Sasl.createSaslServer and Sasl.createSaslClient may return null if a SASL 
> provider that works for the specified configs cannot be created. We don't 
> currently handle this case. As a result broker/client throws 
> NullPointerException if a provider has not been loaded. On the broker-side, 
> we allow brokers to start up successfully even if SASL provider for its 
> enabled mechanisms are not found. For SASL mechanisms 
> PLAIN/SCRAM-xx/OAUTHBEARER, the login module in Kafka loads the SASL 
> providers. If the login module is incorrectly configured, brokers startup and 
> then fail client connections when hitting NPE. Clients see disconnections 
> during authentication as a result. It is difficult to tell from the client or 
> broker logs why the failure occurred. We should fail during startup if SASL 
> providers are not found and provide better diagnostics for this case.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-10338) Support PEM format for SSL certificates and private key

2020-10-07 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-10338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram updated KAFKA-10338:
---
Description: 
We currently support only file-based JKS/PKCS12 format for SSL key stores and 
trust stores. It will be good to add support for PEM as configuration values 
that fits better with config externalization.

KIP: 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-651+-+Support+PEM+format+for+SSL+certificates+and+private+key

  was:We currently support only file-based JKS/PKCS12 format for SSL key stores 
and trust stores. It will be good to add support for PEM as configuration 
values that fits better with config externalization.


> Support PEM format for SSL certificates and private key
> ---
>
> Key: KAFKA-10338
> URL: https://issues.apache.org/jira/browse/KAFKA-10338
> Project: Kafka
>  Issue Type: New Feature
>  Components: security
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>Priority: Major
> Fix For: 2.7.0
>
>
> We currently support only file-based JKS/PKCS12 format for SSL key stores and 
> trust stores. It will be good to add support for PEM as configuration values 
> that fits better with config externalization.
> KIP: 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-651+-+Support+PEM+format+for+SSL+certificates+and+private+key



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-10338) Support PEM format for SSL certificates and private key

2020-10-06 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-10338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram resolved KAFKA-10338.

Fix Version/s: 2.7.0
 Reviewer: Manikumar
   Resolution: Fixed

> Support PEM format for SSL certificates and private key
> ---
>
> Key: KAFKA-10338
> URL: https://issues.apache.org/jira/browse/KAFKA-10338
> Project: Kafka
>  Issue Type: New Feature
>  Components: security
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>Priority: Major
> Fix For: 2.7.0
>
>
> We currently support only file-based JKS/PKCS12 format for SSL key stores and 
> trust stores. It will be good to add support for PEM as configuration values 
> that fits better with config externalization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-10554) Perform follower truncation based on epoch offsets returned in Fetch response

2020-09-30 Thread Rajini Sivaram (Jira)
Rajini Sivaram created KAFKA-10554:
--

 Summary: Perform follower truncation based on epoch offsets 
returned in Fetch response
 Key: KAFKA-10554
 URL: https://issues.apache.org/jira/browse/KAFKA-10554
 Project: Kafka
  Issue Type: Task
  Components: replication
Reporter: Rajini Sivaram
Assignee: Rajini Sivaram
 Fix For: 2.7.0


KAFKA-10435 updated fetch protocol for KIP-595 to return diverging epoch and 
offset as part of fetch response. We can use this to truncate logs in followers 
while processing fetch responses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-10520) InitProducerId may be blocked if least loaded node is not ready to send

2020-09-24 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-10520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram updated KAFKA-10520:
---
Description: 
>From the logs of a failing producer that shows InitProducerId timing out after 
>request timeout, it looks like we don't poll while waiting for transactional 
>producer to be initialized and FindCoordinator request cannot be sent. The 
>producer configuration used one bootstrap server and 
>`max.in.flight.requests.per.connection=1`. The failing sequence:

 # Producer sends MetadataRequest to least loaded node (bootstrap server)
 # Producer is ready to send InitProducerId, needs to find transaction 
coordinator
 # Producer creates FindCoordinator request, but the only node known is the 
bootstrap server. Producer cannot send to this node since there is already the 
Metadata request in flight and max.inflight is 1.
 # Producer waits without polling, so Metadata response is not processed. 
InitProducerId times out eventually.
  
 
 We need to update the condition used to determine whether Sender should poll() 
to fix this issue.

  was:
>From the logs of a failing producer that shows InitProducerId timing out after 
>request timeout, it looks like we don't poll while waiting for transactional 
>producer to be initialized and FindCoordinator request cannot be sent. The 
>producer configuration used one bootstrap server and 
>{color:#172b4d}`{color}{color:#067d17}{color:#172b4d}max.in.flight.requests.per.connection=1`.
> The failing sequence:{color}{color}
 # {color:#067d17}{color:#172b4d}Producer sends MetadataRequest to least loaded 
node (bootstrap server){color}
{color}
 # {color:#067d17}{color:#172b4d}Producer is ready to send InitProducerId, 
needs to find transaction coordinator{color}{color}
 # {color:#067d17}{color:#172b4d}Producer creates FindCoordinator request, but 
the only node known is the bootstrap server. Producer cannot send to this node 
since there is already the Metadata request in flight and max.inflight is 
1.{color}{color}
 # {color:#067d17}{color:#172b4d}Producer waits without polling, so Metadata 
response is not processed. InitProducerId times out eventually.{color}{color}

 

{color:#067d17}{color:#172b4d}We need to update the condition used to determine 
whether Sender should poll() to fix this issue.{color}{color}


> InitProducerId may be blocked if least loaded node is not ready to send
> ---
>
> Key: KAFKA-10520
> URL: https://issues.apache.org/jira/browse/KAFKA-10520
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Reporter: Rajini Sivaram
>Priority: Major
> Fix For: 2.7.0
>
>
> From the logs of a failing producer that shows InitProducerId timing out 
> after request timeout, it looks like we don't poll while waiting for 
> transactional producer to be initialized and FindCoordinator request cannot 
> be sent. The producer configuration used one bootstrap server and 
> `max.in.flight.requests.per.connection=1`. The failing sequence:
>  # Producer sends MetadataRequest to least loaded node (bootstrap server)
>  # Producer is ready to send InitProducerId, needs to find transaction 
> coordinator
>  # Producer creates FindCoordinator request, but the only node known is the 
> bootstrap server. Producer cannot send to this node since there is already 
> the Metadata request in flight and max.inflight is 1.
>  # Producer waits without polling, so Metadata response is not processed. 
> InitProducerId times out eventually.
>   
>  
>  We need to update the condition used to determine whether Sender should 
> poll() to fix this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-10520) InitProducerId may be blocked if least loaded node is not ready to send

2020-09-24 Thread Rajini Sivaram (Jira)
Rajini Sivaram created KAFKA-10520:
--

 Summary: InitProducerId may be blocked if least loaded node is not 
ready to send
 Key: KAFKA-10520
 URL: https://issues.apache.org/jira/browse/KAFKA-10520
 Project: Kafka
  Issue Type: Bug
  Components: producer 
Reporter: Rajini Sivaram
 Fix For: 2.7.0


>From the logs of a failing producer that shows InitProducerId timing out after 
>request timeout, it looks like we don't poll while waiting for transactional 
>producer to be initialized and FindCoordinator request cannot be sent. The 
>producer configuration used one bootstrap server and 
>{color:#172b4d}`{color}{color:#067d17}{color:#172b4d}max.in.flight.requests.per.connection=1`.
> The failing sequence:{color}{color}
 # {color:#067d17}{color:#172b4d}Producer sends MetadataRequest to least loaded 
node (bootstrap server){color}
{color}
 # {color:#067d17}{color:#172b4d}Producer is ready to send InitProducerId, 
needs to find transaction coordinator{color}{color}
 # {color:#067d17}{color:#172b4d}Producer creates FindCoordinator request, but 
the only node known is the bootstrap server. Producer cannot send to this node 
since there is already the Metadata request in flight and max.inflight is 
1.{color}{color}
 # {color:#067d17}{color:#172b4d}Producer waits without polling, so Metadata 
response is not processed. InitProducerId times out eventually.{color}{color}

 

{color:#067d17}{color:#172b4d}We need to update the condition used to determine 
whether Sender should poll() to fix this issue.{color}{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   5   6   7   8   9   >