[jira] [Created] (KAFKA-6464) Base64URL encoding under JRE 1.7 is broken due to incorrect padding assumption
Ron Dagostino created KAFKA-6464: Summary: Base64URL encoding under JRE 1.7 is broken due to incorrect padding assumption Key: KAFKA-6464 URL: https://issues.apache.org/jira/browse/KAFKA-6464 Project: Kafka Issue Type: Bug Components: clients Affects Versions: 1.0.0 Reporter: Ron Dagostino The org.apache.kafka.common.utils.Base64 class defers Base64 encoding/decoding to the java.util.Base64 class beginning with JRE 1.8 but leverages javax.xml.bind.DatatypeConverter under JRE 1.7. The implementation of the encodeToString(bytes[]) method returned under JRE 1.7 by Base64.urlEncoderNoPadding() blindly removes the last two trailing characters of the Base64 encoding under the assumption that they will always be the string "==" but that is incorrect; padding can be "=", "==", or non-existent. For example, this statement: {code:java} Base64.urlEncoderNoPadding().encodeToString( "{\"alg\":\"none\"}".getBytes(StandardCharsets.UTF_8));{code} Yields this, which is incorrect: (because the padding on the Base64 encoded value is "=" instead of the assumed "==", so an extra character is incorrectly trimmed): {{eyJhbGciOiJub25lIn}} The correct value is: {{eyJhbGciOiJub25lIn0}} There is also no Base64.urlDecoder() method, which aside from providing useful functionality would also make it easy to write a unit test (there currently is none). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KAFKA-6562) KIP-255: OAuth Authentication via SASL/OAUTHBEARER
Ron Dagostino created KAFKA-6562: Summary: KIP-255: OAuth Authentication via SASL/OAUTHBEARER Key: KAFKA-6562 URL: https://issues.apache.org/jira/browse/KAFKA-6562 Project: Kafka Issue Type: Improvement Components: clients Reporter: Ron Dagostino KIP-255: OAuth Authentication via SASL/OAUTHBEARER (https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=75968876) proposes adding the ability to authenticate to Kafka with OAuth 2 bearer tokens using the OAUTHBEARER SASL mechanism. Token retrieval and token validation are both pluggable. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KAFKA-7231) NetworkClient.newClientRequest() ignores custom request timeout in favor of the default
Ron Dagostino created KAFKA-7231: Summary: NetworkClient.newClientRequest() ignores custom request timeout in favor of the default Key: KAFKA-7231 URL: https://issues.apache.org/jira/browse/KAFKA-7231 Project: Kafka Issue Type: Bug Components: clients Affects Versions: 2.0.0 Reporter: Ron Dagostino The below code in {{org.apache.kafka.clients.KafkaClient}} is not passing in the provided {{requestTimeoutMs}} -- it is ignoring it in favor of the {{defaultRequestTimeoutMs}} value. {code:java} @Override public ClientRequest newClientRequest(String nodeId, AbstractRequest.Builder requestBuilder, long createdTimeMs, boolean expectResponse, int requestTimeoutMs, RequestCompletionHandler callback) { return new ClientRequest(nodeId, requestBuilder, correlation++, clientId, createdTimeMs, expectResponse, defaultRequestTimeoutMs, callback); } {code} This is an easy fix, but the impact of fixing it is difficult to quantify. Clients that set a custom timeout are getting the default timeout of 1000 ms -- fixing this will suddenly cause the custom timeout to take effect. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KAFKA-7324) NPE due to lack of SASLExtensions in SASL/OAUTHBEARER
Ron Dagostino created KAFKA-7324: Summary: NPE due to lack of SASLExtensions in SASL/OAUTHBEARER Key: KAFKA-7324 URL: https://issues.apache.org/jira/browse/KAFKA-7324 Project: Kafka Issue Type: Bug Components: clients Affects Versions: 2.0.1 Reporter: Ron Dagostino Assignee: Ron Dagostino Fix For: 2.0.1 When there are no SASL extensions in an OAUTHBEARER request (or the callback handler does not support SaslExtensionsCallback) the OAuthBearerSaslClient.retrieveCustomExtensions() method returns null. This null value is then passed to the OAuthBearerClientInitialResponse constructor, and that results in an NPE: java.lang.NullPointerException at org.apache.kafka.common.security.oauthbearer.internals.OAuthBearerClientInitialResponse.validateExtensions(OAuthBearerClientInitialResponse.java:115) at org.apache.kafka.common.security.oauthbearer.internals.OAuthBearerClientInitialResponse.(OAuthBearerClientInitialResponse.java:81) at org.apache.kafka.common.security.oauthbearer.internals.OAuthBearerClientInitialResponse.(OAuthBearerClientInitialResponse.java:75) at org.apache.kafka.common.security.oauthbearer.internals.OAuthBearerSaslClient.evaluateChallenge(OAuthBearerSaslClient.java:99) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KAFKA-7352) KIP-368: Allow SASL Connections to Periodically Re-Authenticate
Ron Dagostino created KAFKA-7352: Summary: KIP-368: Allow SASL Connections to Periodically Re-Authenticate Key: KAFKA-7352 URL: https://issues.apache.org/jira/browse/KAFKA-7352 Project: Kafka Issue Type: Improvement Components: clients, core Reporter: Ron Dagostino Assignee: Ron Dagostino KIP-368: Allow SASL Connections to Periodically Re-Authenticate The adoption of KIP-255: OAuth Authentication via SASL/OAUTHBEARER in release 2.0.0 creates the possibility of using information in the bearer token to make authorization decisions. Unfortunately, however, Kafka connections are long-lived, so there is no ability to change the bearer token associated with a particular connection. Allowing SASL connections to periodically re-authenticate would resolve this. In addition to this motivation there are two others that are security-related. First, to eliminate access to Kafka the current requirement is to remove all authorizations (i.e. remove all ACLs). This is necessary because of the long-lived nature of the connections. It is operationally simpler to shut off access at the point of authentication, and with the release of KIP-86: Configurable SASL Callback Handlers it is going to become more and more likely that installations will authenticate users against external directories (e.g. via LDAP). The ability to stop Kafka access by simply disabling an account in an LDAP directory (for example) is desirable. The second motivating factor for re-authentication related to security is that the use of short-lived tokens is a common OAuth security recommendation, but issuing a short-lived token to a Kafka client (or a broker when OAUTHBEARER is the inter-broker protocol) currently has no benefit because once a client is connected to a broker the client is never challenged again and the connection may remain intact beyond the token expiration time (and may remain intact indefinitely under perfect circumstances). This KIP proposes adding the ability for clients (and brokers when OAUTHBEARER is the inter-broker protocol) to re-authenticate their connections to brokers and have the new bearer token appear on their session rather than the old one. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KAFKA-7182) SASL/OAUTHBEARER client response is missing %x01 separators
Ron Dagostino created KAFKA-7182: Summary: SASL/OAUTHBEARER client response is missing %x01 separators Key: KAFKA-7182 URL: https://issues.apache.org/jira/browse/KAFKA-7182 Project: Kafka Issue Type: Bug Components: clients Affects Versions: 2.0.0 Reporter: Ron Dagostino Assignee: Ron Dagostino The format of the SASL/OAUTHBEARER client response is defined in [RFC 7628 Section 3.1|https://tools.ietf.org/html/rfc7628#section-3.1] as follows: {noformat} kvsep = %x01 key= 1*(ALPHA) value = *(VCHAR / SP / HTAB / CR / LF ) kvpair = key "=" value kvsep client-resp= (gs2-header kvsep *kvpair kvsep) / kvsep {noformat} ;;gs2-header = See [RFC 5801 (Section 4)|https://tools.ietf.org/html/rfc5801#section-4] The SASL/OAUTHBEARER client response as currently implemented in OAuthBearerSaslClient sends the valid gs2-header "n,," but then sends the "auth" key and value immediately after it, like this: {code:java} String.format("n,,auth=Bearer %s", callback.token().value()) {code} This does not conform to the specification because there is no %x01 after the gs2-header, no %x01 after the auth value, and no terminating %x01. The code should instead be as follows: {code:java} String.format("n,,\u0001auth=Bearer %s\u0001\u0001", callback.token().value()) {code} Similarly, the parsing of the client response in OAuthBearerSaslServer, which currently allows the malformed text, must also change. *This should be fixed prior to the initial release of the SASL/OAUTHBEARER code in 2.0.0 to prevent compatibility problems.* -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KAFKA-6664) KIP-270 Substitution Within Configuration Values
Ron Dagostino created KAFKA-6664: Summary: KIP-270 Substitution Within Configuration Values Key: KAFKA-6664 URL: https://issues.apache.org/jira/browse/KAFKA-6664 Project: Kafka Issue Type: Improvement Components: clients Reporter: Ron Dagostino KIP 270 (Substitution Within Configuration Values) proposes adding support for substitution within client JAAS configuration values for PLAIN and SCRAM-related SASL mechanisms in a backwards-compatible manner and making the functionality available to other existing (or future) configuration contexts where it is deemed appropriate. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KAFKA-7960) KIP-432: Additional Broker-Side Opt-In for Default, Unsecure SASL/OAUTHBEARER Implementation
Ron Dagostino created KAFKA-7960: Summary: KIP-432: Additional Broker-Side Opt-In for Default, Unsecure SASL/OAUTHBEARER Implementation Key: KAFKA-7960 URL: https://issues.apache.org/jira/browse/KAFKA-7960 Project: Kafka Issue Type: Improvement Components: clients Affects Versions: 2.1.1, 2.1.0, 2.0.1, 2.0.0, 2.2.0, 2.1.2 Reporter: Ron Dagostino The default implementation of SASL/OAUTHBEARER, as per KIP-255 (https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=75968876), is unsecured. This is useful for development and testing purposes, and it provides a great out-of-the-box experience, but it must not be used in production because it allows the client to authenticate with any principal name it wishes. To enable the default unsecured SASL/OAUTHBEARER implementation on the broker side simply requires the addition of OAUTHBEARER to the sasl.enabled.mechanisms configuration value (for example: sasl.enabled.mechanisms=GSSAPI,OAUTHBEARER instead of simply sasl.enabled.mechanisms=GSSAPI). To secure the implementation requires the explicit setting of the listener.name.{sasl_plaintext|sasl_ssl}.oauthbearer.sasl.{login,server}.callback.handler.class properties on the broker. The question then arises: what if someone either accidentally or maliciously appended OAUTHBEARER to the sasl.enabled.mechanisms configuration value? Doing so would enable the unsecured implementation on the broker, and clients could then authenticate with any principal name they desired. KIP-432 (https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=103091238) proposes to add an additional opt-in configuration property on the broker side for the default, unsecured SASL/OAUTHBEARER implementation such that simply adding OAUTHBEARER to the sasl.enabled.mechanisms configuration value would be insufficient to enable the feature. This additional opt-in broker configuration property would have to be explicitly set to true before the default unsecured implementation would successfully authenticate users, and the name of this configuration property would explicitly indicate that the feature is not secure and must not be used in production. Adding this explicit opt-in is a breaking change; existing uses of the unsecured implementation would have to update their configuration to include this explicit opt-in property before their cluster would accept unsecure tokens again. Note that this would only result in a breaking change in production if the unsecured feature is either accidentally or maliciously enabled there; it is assumed that 1) this will probably not happen to anyone; and 2) if it does happen to someone it almost certainly would not impact sanctioned clients but would instead impact malicious clients only (if there were any). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KAFKA-7902) SASL/OAUTHBEARER can become unable to connect: javax.security.sasl.SaslException: Unable to find OAuth Bearer token in Subject's private credentials (size=2)
Ron Dagostino created KAFKA-7902: Summary: SASL/OAUTHBEARER can become unable to connect: javax.security.sasl.SaslException: Unable to find OAuth Bearer token in Subject's private credentials (size=2) Key: KAFKA-7902 URL: https://issues.apache.org/jira/browse/KAFKA-7902 Project: Kafka Issue Type: Bug Components: clients Affects Versions: 2.1.0, 2.0.1, 2.0.0 Reporter: Ron Dagostino Assignee: Ron Dagostino It is possible for a Java SASL/OAUTHBEARER client (either a non-broker producer/consumer client or a broker when acting as an inter-broker client) to end up in a state where it cannot connect to a new broker (or, if re-authentication as implemented by KIP-368 and merged for v2.2.0 were to be deployed and enabled, to be unable to re-authenticate). The error message looks like this: {{Connection to node 1 failed authentication due to: An error: (java.security.PrivilegedActionException: javax.security.sasl.SaslException: Unable to find OAuth Bearer token in Subject's private credentials (size=2) [Caused by java.io.IOException: Unable to find OAuth Bearer token in Subject's private credentials (size=2)]) occurred when evaluating SASL token received from the Kafka Broker. Kafka Client will go to AUTHENTICATION_FAILED state.}} The root cause of the problem begins at this point in the code: [https://github.com/apache/kafka/blob/2.0/clients/src/main/java/org/apache/kafka/common/security/oauthbearer/internals/expiring/ExpiringCredentialRefreshingLogin.java#L378]: The {{loginContext}} field doesn't get replaced with the old version stored away in the {{optionalLoginContextToLogout}} variable if/when the {{loginContext.login()}} call on line 381 throws an exception. *This is an unusual event* – the OAuth authorization server must be unavailable at the moment when the token refresh occurs – but when it does happen it puts the refresher thread instance in an invalid state because now its {{loginContext}} field represents the one that failed instead of the original one, which is now lost. The current {{loginContext}} can't be logged out – it will throw an {{InvalidStateException}} if that is attempted because there is no token associated with it -- and the token associated with the login context that was lost can never be logged out and removed from the Subject's private credentials (because we don't retain a reference to it). The net effect is that we end up with an extra token on the Subject's private credentials, which eventually results in the exception mentioned above when the client tries to authenticate to a broker. So the chain of events is: 1) login failure upon token refresh causes the refresher thread's login context field to be incorrect, and the existing token on the Subject's private credentials will never be logged out/removed 2) retry occurs in 10 seconds, potentially repeatedly until the authorization server is back online 3) login succeeds, adding a second token to the Subject's private credentials (logout is then called on the login context set incorrectly in the most recent failure -- e.g. in step 1 -- which results in an exception, but this is not the real issue -- it is the 2 tokens on the Subject's private credentials that is the issue) 4) At this point we now have 2 tokens on the Subject, and then at some point in the future the client tries to make a new connection, it sees the 2 tokens and throws an exception – BOOM! The client is now unable to connect (or re-authenticate if applicable) going forward. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (KAFKA-9284) Add documentation and system tests for TLS-encrypted Zookeeper connections
Ron Dagostino created KAFKA-9284: Summary: Add documentation and system tests for TLS-encrypted Zookeeper connections Key: KAFKA-9284 URL: https://issues.apache.org/jira/browse/KAFKA-9284 Project: Kafka Issue Type: Improvement Components: documentation, system tests Affects Versions: 2.4.0 Reporter: Ron Dagostino Assignee: Ron Dagostino TLS connectivity to Zookeeper became available in the 3.5.x versions. Now with the inclusion of these Zookeeper versions Kafka should supply documentation that distills the steps required to take advantage of TLS and include systems tests to validate such setups. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-9241) SASL Clients are not forced to re-authenticate if they don't leverage SaslAuthenticateRequest
Ron Dagostino created KAFKA-9241: Summary: SASL Clients are not forced to re-authenticate if they don't leverage SaslAuthenticateRequest Key: KAFKA-9241 URL: https://issues.apache.org/jira/browse/KAFKA-9241 Project: Kafka Issue Type: Bug Components: clients Affects Versions: 2.2.1, 2.3.0, 2.2.0 Reporter: Ron Dagostino Assignee: Ron Dagostino Brokers are supposed to force SASL clients to re-authenticate (and kill such connections in the absence of a timely and successful re-authentication) when SASL Re-Authentication [(KIP-368)|https://cwiki.apache.org/confluence/display/KAFKA/KIP-368%3A+Allow+SASL+Connections+to+Periodically+Re-Authenticate] is enabled via a positive `connections.max.reauth.ms` configuration value. There is a flaw in the logic that causes connections to not be killed in the absence of a timely and successful re-authentication _if the client does not leverage the SaslAuthenticateRequest API_ (which was defined in [KIP-152|https://cwiki.apache.org/confluence/display/KAFKA/KIP-152+-+Improve+diagnostics+for+SASL+authentication+failures]). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-9469) Add zookeeper.ssl.context.supplier.class config if/when adopting ZooKeeper 3.6
Ron Dagostino created KAFKA-9469: Summary: Add zookeeper.ssl.context.supplier.class config if/when adopting ZooKeeper 3.6 Key: KAFKA-9469 URL: https://issues.apache.org/jira/browse/KAFKA-9469 Project: Kafka Issue Type: New Feature Components: config Reporter: Ron Dagostino Assignee: Ron Dagostino The "zookeeper.ssl.context.supplier.class" configuration doesn't actually exist in ZooKeeper 3.5.6. The ZooKeeper admin guide documents it as being there, but it doesn't appear in the code. This means we can't support it in KIP-515, and it has been removed from that KIP. I checked the latest ZooKeeper 3.6 SNAPSHOT, and it has been added. So this config could probably be added to Kafka via a new, small KIP if/when we upgrade to ZooKeeper 3.6 (which looks to be in release-candidate stage at the moment). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-9575) "Notable changes in 2.5.0" doesn't mention ZooKeeper 3.5.7
Ron Dagostino created KAFKA-9575: Summary: "Notable changes in 2.5.0" doesn't mention ZooKeeper 3.5.7 Key: KAFKA-9575 URL: https://issues.apache.org/jira/browse/KAFKA-9575 Project: Kafka Issue Type: Improvement Components: docs, documentation Affects Versions: 2.5.0 Reporter: Ron Dagostino Assignee: Ron Dagostino Fix For: 2.5.0 There is a paragraph in the 2.4.0 upgrade notes talking about ZooKeeper bugs that make manual intervention recommended while upgrading from ZoKeeper 3.4. Both of the ZooKeeper bugs that are mentioned in the paragraph are fixed in 3.5.7, so at a minimum we should mention the ZooKeeper 3.5.7 upgrade in the AK 2.5.0 upgrade notes section. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-9284) Add documentation and system tests for TLS-encrypted Zookeeper connections
[ https://issues.apache.org/jira/browse/KAFKA-9284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ron Dagostino resolved KAFKA-9284. -- Fix Version/s: 2.5.0 Resolution: Duplicate Duplicate > Add documentation and system tests for TLS-encrypted Zookeeper connections > -- > > Key: KAFKA-9284 > URL: https://issues.apache.org/jira/browse/KAFKA-9284 > Project: Kafka > Issue Type: Improvement > Components: documentation, system tests >Affects Versions: 2.4.0 >Reporter: Ron Dagostino >Assignee: Ron Dagostino >Priority: Minor > Fix For: 2.5.0 > > > TLS connectivity to Zookeeper became available in the 3.5.x versions. Now > with the inclusion of these Zookeeper versions Kafka should supply > documentation that distills the steps required to take advantage of TLS and > include systems tests to validate such setups. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-9567) Docs and system tests for ZooKeeper 3.5.7 and KIP-515
Ron Dagostino created KAFKA-9567: Summary: Docs and system tests for ZooKeeper 3.5.7 and KIP-515 Key: KAFKA-9567 URL: https://issues.apache.org/jira/browse/KAFKA-9567 Project: Kafka Issue Type: Improvement Affects Versions: 2.5.0 Reporter: Ron Dagostino These changes depend on [KIP-515: Enable ZK client to use the new TLS supported authentication|https://cwiki.apache.org/confluence/display/KAFKA/KIP-515%3A+Enable+ZK+client+to+use+the+new+TLS+supported+authentication], which was only added to 2.5.0. The upgrade to ZooKeeper 3.5.7 was merged to both 2.5.0 and 2.4.1 via https://issues.apache.org/jira/browse/KAFKA-9515, but this change must only be merged to 2.5.0 (it will break the system tests if merged to 2.4.1). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-10213) Prefer --bootstrap-server in ducktape tests for Kafka clients
[ https://issues.apache.org/jira/browse/KAFKA-10213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ron Dagostino resolved KAFKA-10213. --- Resolution: Fixed > Prefer --bootstrap-server in ducktape tests for Kafka clients > - > > Key: KAFKA-10213 > URL: https://issues.apache.org/jira/browse/KAFKA-10213 > Project: Kafka > Issue Type: Sub-task >Reporter: Vinoth Chandar >Assignee: Ron Dagostino >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-10258) Get rid of use_zk_connection flag in kafka.py public methods
[ https://issues.apache.org/jira/browse/KAFKA-10258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ron Dagostino resolved KAFKA-10258. --- Resolution: Fixed > Get rid of use_zk_connection flag in kafka.py public methods > > > Key: KAFKA-10258 > URL: https://issues.apache.org/jira/browse/KAFKA-10258 > Project: Kafka > Issue Type: Sub-task >Reporter: Vinoth Chandar >Assignee: Ron Dagostino >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-10131) Minimize use of --zookeeper flag in ducktape tests
[ https://issues.apache.org/jira/browse/KAFKA-10131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ron Dagostino resolved KAFKA-10131. --- Fix Version/s: 2.7.0 Reviewer: Colin McCabe Resolution: Fixed PR: https://github.com/apache/kafka/pull/9274 > Minimize use of --zookeeper flag in ducktape tests > -- > > Key: KAFKA-10131 > URL: https://issues.apache.org/jira/browse/KAFKA-10131 > Project: Kafka > Issue Type: Improvement > Components: system tests >Reporter: Vinoth Chandar >Assignee: Ron Dagostino >Priority: Major > Fix For: 2.7.0 > > > Get the ducktape tests working without the --zookeeper flag (except for > scram). > (Note: When doing compat testing we'll still use the old flags.) > Below are the current usages > {code:java} > [tests]$ grep -R -e "--zookeeper" . > ./kafkatest/tests/core/zookeeper_tls_encrypt_only_test.py:# Cannot > use --zookeeper because kafka-topics.sh is unable to connect to a TLS-enabled > ZooKeeper quorum, > ./kafkatest/tests/client/quota_test.py:cmd = "%s --zookeeper %s > --alter --add-config producer_byte_rate=%d,consumer_byte_rate=%d" % \ > ./kafkatest/services/console_consumer.py:cmd += " --zookeeper > %(zk_connect)s" % args > ./kafkatest/services/security/security_config.py:cmd = "%s > --zookeeper %s --entity-name %s --entity-type users --alter --add-config > %s=[password=%s]" % \ > ./kafkatest/services/zookeeper.py:la_migra_cmd += "%s > --zookeeper.acl=%s --zookeeper.connect=%s %s" % \ > ./kafkatest/services/zookeeper.py:cmd = "%s kafka.admin.ConfigCommand > --zookeeper %s %s --describe --topic %s" % \ > # Used by MessageFormatChangeTest, TruncationTest > ./kafkatest/services/kafka/kafka.py:cmd += "%s --zookeeper %s %s > --entity-name %s --entity-type topics --alter --add-config > message.format.version=%s" % \ > ./kafkatest/services/kafka/kafka.py:cmd += "%s --zookeeper %s %s > --entity-name %s --entity-type topics --alter --add-config > unclean.leader.election.enable=%s" % \ > # called by reassign_partitions.sh, ThrottlingTest, ReassignPartitionsTest > ./kafkatest/services/kafka/kafka.py:cmd += "--zookeeper %s " % > self.zk_connect_setting() > ./kafkatest/services/kafka/kafka.py:cmd += "--zookeeper %s " % > self.zk_connect_setting() > ./kafkatest/services/kafka/kafka.py:connection_setting = > "--zookeeper %s" % (self.zk_connect_setting()) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-6664) KIP-269 Substitution Within Configuration Values
[ https://issues.apache.org/jira/browse/KAFKA-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ron Dagostino resolved KAFKA-6664. -- Resolution: Won't Do KIP was not accepted > KIP-269 Substitution Within Configuration Values > > > Key: KAFKA-6664 > URL: https://issues.apache.org/jira/browse/KAFKA-6664 > Project: Kafka > Issue Type: Improvement > Components: clients >Reporter: Ron Dagostino >Assignee: Ron Dagostino >Priority: Major > > KIP 269 (Substitution Within Configuration Values) proposes adding support > for substitution within client JAAS configuration values for PLAIN and > SCRAM-related SASL mechanisms in a backwards-compatible manner and making the > functionality available to other existing (or future) configuration contexts > where it is deemed appropriate. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-10418) Incomplete error/docs when altering topic configs via kafka-topics with --bootstrap-server
[ https://issues.apache.org/jira/browse/KAFKA-10418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ron Dagostino resolved KAFKA-10418. --- Resolution: Fixed PR: https://github.com/apache/kafka/pull/9199 > Incomplete error/docs when altering topic configs via kafka-topics with > --bootstrap-server > -- > > Key: KAFKA-10418 > URL: https://issues.apache.org/jira/browse/KAFKA-10418 > Project: Kafka > Issue Type: Improvement >Reporter: Ron Dagostino >Assignee: Ron Dagostino >Priority: Minor > Fix For: 2.7.0 > > > Changing a topic config with the kafka-topics command while connecting to > Kafka via --bootstrap-server (rather than connecting to ZooKeeper via > --zookeeper) is not supported. The desired functionality is available > elsewhere, though: it is possible to change a topic config while connecting > to Kafka rather than ZooKeeper via the kafka-configs command instead. > However, neither the kafka-topics error message received nor the kafka-topics > help information itself indicates this other possibility. For example: > {{$ kafka-topics.sh --bootstrap-server localhost:9092 --alter --topic test > --config flush.messages=12345 > Option combination "[bootstrap-server],[config]" can't be used with option > "[alter]" > }} > {{$ kafka-topics.sh > ... > --config A topic configuration override for > the topic being created or altered...It is supported only in combination with > -- create if --bootstrap-server option is used. > }} > Rather than simply saying that what you want to do isn't available, it would > be better to say also that you can do it with the kafka-configs command. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-10451) system tests send large command over ssh instead of using remote file for security config
Ron Dagostino created KAFKA-10451: - Summary: system tests send large command over ssh instead of using remote file for security config Key: KAFKA-10451 URL: https://issues.apache.org/jira/browse/KAFKA-10451 Project: Kafka Issue Type: Improvement Components: system tests Reporter: Ron Dagostino In `kafka.py` the pattern used to supply security configuration information to remote CLI tools is to send the information as part of the ssh command. For example, see this --command-config definition: {{Running ssh command: export KAFKA_OPTS="-Djava.security.auth.login.config=/mnt/security/admin_client_as_broker_jaas.conf -Djava.security.krb5.conf=/mnt/security/krb5.conf"; /opt/kafka-dev/bin/kafka-configs.sh --bootstrap-server worker2:9095 --command-config <(echo ' ssl.endpoint.identification.algorithm=HTTPS sasl.kerberos.service.name=kafka security.protocol=SASL_SSL ssl.keystore.location=/mnt/security/test.keystore.jks ssl.truststore.location=/mnt/security/test.truststore.jks ssl.keystore.password=test-ks-passwd sasl.mechanism=SCRAM-SHA-256 ssl.truststore.password=test-ts-passwd ssl.key.password=test-ks-passwd sasl.mechanism.inter.broker.protocol=GSSAPI ') --entity-name kafka-client --entity-type users --alter --add-config SCRAM-SHA-256=[password=client-secret]}} This ssh command length is getting pretty big. It would be best if this referred to a file as opposed to sending in the file contents as part of the ssh command. This happens in a few places in `kafka/py` and should be rectified. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-10592) system tests not running after python3 merge
Ron Dagostino created KAFKA-10592: - Summary: system tests not running after python3 merge Key: KAFKA-10592 URL: https://issues.apache.org/jira/browse/KAFKA-10592 Project: Kafka Issue Type: Task Components: system tests Reporter: Ron Dagostino Assignee: Nikolay Izhikov We are seeing these errors on system tests due to the python3 merge: [ERROR:2020-10-08 21:03:51,341]: Failed to import kafkatest.sanity_checks.test_performance_services, which may indicate a broken test that cannot be loaded: ImportError: No module named server [ERROR:2020-10-08 21:03:51,351]: Failed to import kafkatest.benchmarks.core.benchmark_test, which may indicate a broken test that cannot be loaded: ImportError: No module named server [ERROR:2020-10-08 21:03:51,501]: Failed to import kafkatest.tests.core.throttling_test, which may indicate a broken test that cannot be loaded: ImportError: No module named server [ERROR:2020-10-08 21:03:51,598]: Failed to import kafkatest.tests.client.quota_test, which may indicate a broken test that cannot be loaded: ImportError: No module named server I ran one of the system tests at the commit prior to the python3 merge (https://github.com/apache/kafka/commit/40a23cc0c2e1efa8632f59b093672221a3c03c36) and it ran fine: http://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/2020-10-09--001.1602255415--rondagostino--rtd_just_before_python3_merge--40a23cc0c/report.html I ran the exact same test file at the next commit -- the python3 commit at https://github.com/apache/kafka/commit/4e65030e055104a7526e85b563a11890c61d6ddf -- and it failed with the import error. The test results show no report.html file because nothing ran: http://testing.confluent.io/confluent-kafka-system-test-results/?prefix=2020-10-09--001.1602251990--apache--trunk--7947c18b5/ Not sure when this began because I do see these tests running successfully during the development process as documented in https://issues.apache.org/jira/browse/KAFKA-10402 (`tests run:684` as recently as 9/20 in that ticket). But the PR build (rebased onto latest trunk) showed the above import errors and only 606 tests run. I assume those 4 files mentioned include 78 tests. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-10418) Unclear deprecation of altering topic configs via kafka-topics with --bootstrap-server
Ron Dagostino created KAFKA-10418: - Summary: Unclear deprecation of altering topic configs via kafka-topics with --bootstrap-server Key: KAFKA-10418 URL: https://issues.apache.org/jira/browse/KAFKA-10418 Project: Kafka Issue Type: Improvement Reporter: Ron Dagostino Assignee: Ron Dagostino Fix For: 2.7.0 Changing a topic config with the kafka-topics command while connecting to Kafka via --bootstrap-server (rather than connecting to ZooKeeper via --zookeeper) has been deprecated. The desired functionality is available elsewhere: it is possible to change a topic config while connecting to Kafka rather than ZooKeeper via the kafka-configs command instead. However, neither the kafka-topics error message received nor the kafka-topics help information itself indicates this other possibility. For example: {{$ kafka-topics.sh --bootstrap-server localhost:9092 --alter --topic test --config flush.messages=12345 Option combination "[bootstrap-server],[config]" can't be used with option "[alter]" }} {{$ kafka-topics.sh ... --config A topic configuration override for the topic being created or altered...It is supported only in combination with -- create if --bootstrap-server option is used. }} Rather than simply saying that what you want to do isn't available, it would be better to say also that you can do it with the kafka-configs command. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-10443) Consider providing standard set of users in system tests
Ron Dagostino created KAFKA-10443: - Summary: Consider providing standard set of users in system tests Key: KAFKA-10443 URL: https://issues.apache.org/jira/browse/KAFKA-10443 Project: Kafka Issue Type: Test Components: system tests Reporter: Ron Dagostino As part of the KIP-554 implementation we decided to exercise the AdminClient interface for creating SCRAM credentials within the system tests. So instead of bootstrapping both the broker and the user credentials via ZooKeeper (`kafka-configs.sh --alter --zookeeper`) before the broker starts, we bootstrapped just the broker credential via ZooKeeper and then we started the brokers and created the user credential afterwards via the AdminClient (`kafka-configs.sh --alter --bootstrap-server`). We did this by configuring the admin client to log in as the broker. This works fine, but it feels like we should have a separate "admin" user available to do this rather than having to authenticate the admin client as the broker. Furthermore, this feels like it might be a good pattern to consider everywhere -- whenever we create a broker user we should also create an admin user for tests that want/need to leverage it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-10556) NPE if sasl.mechanism is unrecognized
Ron Dagostino created KAFKA-10556: - Summary: NPE if sasl.mechanism is unrecognized Key: KAFKA-10556 URL: https://issues.apache.org/jira/browse/KAFKA-10556 Project: Kafka Issue Type: Task Reporter: Ron Dagostino Assignee: Ron Dagostino If a client sets an unknown sasl.mechanism value (e.g. mistakenly setting "PLAN" instead of "PLAIN") the client sees a NullPointerException that only indirectly indicates the nature of the problem. For example: java.lang.NullPointerException at org.apache.kafka.common.security.authenticator.SaslClientAuthenticator.sendSaslClientToken(SaslClientAuthenticator.java:430) It is better to see an exception that directly states what the issue is. For example, the initial version of this PR would provide the following information: Caused by: org.apache.kafka.common.errors.SaslAuthenticationException: Failed to create SaslClient with mechanism PLAN -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-10076) Doc prefixed vs. non-prefix configs
Ron Dagostino created KAFKA-10076: - Summary: Doc prefixed vs. non-prefix configs Key: KAFKA-10076 URL: https://issues.apache.org/jira/browse/KAFKA-10076 Project: Kafka Issue Type: Improvement Components: docs Reporter: Ron Dagostino Listener-prefixed configs have higher precedence than unprefixed configs. For example, {{listener.name.default.sasl.enabled.mechanisms}} has higher precedence (and overrides any value of) {{sasl.enabled.mechanisms}}. The docs for {{sasl.enabled.mechanism}} says nothing about this, and while it may be mentioned elsewhere (I did not do an exhaustive search, so I don’t really know), it feels like this could be better documented. In particular, I think there could be two general changes that would be useful: # If a particular config can be overridden by a prefixed version, then the specific documentation for that config could explicitly state this (e.g. add something to the {{sasl.enabled.mechanisms}} documentation is just one example). # Add a general paragraph somewhere that describes the concept of prefixed configs and how they work/what their precedence is relative to unprefixed configs, and (maybe?) the list of configs that can be prefixed. (Again, I didn’t do an exhaustive search for this, so it might already exist.) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-10259) KIP-554: Add Broker-side SCRAM Config API
Ron Dagostino created KAFKA-10259: - Summary: KIP-554: Add Broker-side SCRAM Config API Key: KAFKA-10259 URL: https://issues.apache.org/jira/browse/KAFKA-10259 Project: Kafka Issue Type: New Feature Reporter: Ron Dagostino Assignee: Ron Dagostino -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-10180) TLSv1.3 system tests should not run under Java 8
Ron Dagostino created KAFKA-10180: - Summary: TLSv1.3 system tests should not run under Java 8 Key: KAFKA-10180 URL: https://issues.apache.org/jira/browse/KAFKA-10180 Project: Kafka Issue Type: Bug Components: system tests Affects Versions: 2.6.0 Reporter: Ron Dagostino 18 system tests relates to TLSv1.3 are running and failing under Java 8. These system tests should not run except when Java 11 or later is in use. http://testing.confluent.io/confluent-kafka-system-test-results/?prefix=2020-06-16--001.1592310680--confluentinc--master--d07ee594d/ (e.g. http://testing.confluent.io/confluent-kafka-system-test-results/?prefix=2020-06-16--001.1592310680--confluentinc--master--d07ee594d/Benchmark/test_end_to_end_latency/interbroker_security_protocol=PLAINTEXT.tls_version=TLSv1.3.security_protocol=SSL.compression_type=snappy/) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-12799) Extend TestSecurityRollingUpgrade system test to KRaft
Ron Dagostino created KAFKA-12799: - Summary: Extend TestSecurityRollingUpgrade system test to KRaft Key: KAFKA-12799 URL: https://issues.apache.org/jira/browse/KAFKA-12799 Project: Kafka Issue Type: Test Components: system tests Reporter: Ron Dagostino The TestSecurityRollingUpgrade system test rolls Kafka brokers multiple times to adjust listeners/inter-broker listeners while confirming that producers and consumers continue to work throughout the multiple rolls. We need to extend this test (or write a new one) that does something similar for the KRaft controllers. Producers/consumers are perhaps less appropriate in such a case -- maybe we need to create a topic after each roll to make sure the metalog is being consumed correctly? Note that this will require some logic in `KafkaService.security_config()` because we cache the security config and we will have to mutate it to get the changes to occur. See https://github.com/apache/kafka/pull/10694/ for what we had to do for Kafka broker changes; something similar will have to happen for controller changes. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-12897) KRaft Controller cannot create topic with more partitions than racks
Ron Dagostino created KAFKA-12897: - Summary: KRaft Controller cannot create topic with more partitions than racks Key: KAFKA-12897 URL: https://issues.apache.org/jira/browse/KAFKA-12897 Project: Kafka Issue Type: Bug Components: controller Affects Versions: 3.0.0 Reporter: Ron Dagostino Assignee: Ron Dagostino Fix For: 3.0.0 https://github.com/apache/kafka/pull/10494 introduced a bug in the KRaft controller where the controller will loop forever in `StripedReplicaPlacer` trying to identify the racks on which to place partitions if the number of requested partitions in a CREATE_TOPICS request exceeds the number of effective racks ("effective" meaning a single rack if none are specified). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-13069) Add magic number to DefaultKafkaPrincipalBuilder.KafkaPrincipalSerde
Ron Dagostino created KAFKA-13069: - Summary: Add magic number to DefaultKafkaPrincipalBuilder.KafkaPrincipalSerde Key: KAFKA-13069 URL: https://issues.apache.org/jira/browse/KAFKA-13069 Project: Kafka Issue Type: Bug Affects Versions: 2.8.0, 3.0.0 Reporter: Ron Dagostino Assignee: Ron Dagostino Fix For: 3.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-12318) system tests need to fetch Topic IDs via Admin Client instead of via ZooKeeper
Ron Dagostino created KAFKA-12318: - Summary: system tests need to fetch Topic IDs via Admin Client instead of via ZooKeeper Key: KAFKA-12318 URL: https://issues.apache.org/jira/browse/KAFKA-12318 Project: Kafka Issue Type: Task Components: system tests Affects Versions: 3.0.0, 2.8.0 Reporter: Ron Dagostino https://github.com/apache/kafka/commit/86b9fdef2b9e6ef3429313afbaa18487d6e2906e#diff-2b222ad67f56a2876410aba3eeecd78e8b26217192dde72a035c399dc4d3988bR1033-R1052 introduced a topic_id() function in the system tests. This function is currently coded to talk directly to ZooKeeper. This will be a problem when running a Raft-based metadata quorum -- ZooKeeper won't be available. This method needs to be rewritten to leverage the Admin Client. This does not have to be fixed in 2.8 -- the method is only used in upgrade/downgrade-related system tests, and those system tests aren't being performed for Raft-based metadata quorums in the release (Raft-based metadata quorums will only be alpha/preview functionality at that point with upgrades/downgrades unsupported). But it probably will have to be fixed for the next release after that. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-12455) OffsetValidationTest.test_broker_rolling_bounce failing for Raft quorums
Ron Dagostino created KAFKA-12455: - Summary: OffsetValidationTest.test_broker_rolling_bounce failing for Raft quorums Key: KAFKA-12455 URL: https://issues.apache.org/jira/browse/KAFKA-12455 Project: Kafka Issue Type: Bug Affects Versions: 2.8.0 Reporter: Ron Dagostino Assignee: Ron Dagostino OffsetValidationTest.test_broker_rolling_bounce in `consumer_test.py` is failing because the consumer group is rebalancing unexpectedly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-12480) Reuse bootstrap servers in clients when last alive broker in cluster metadata is unavailable
Ron Dagostino created KAFKA-12480: - Summary: Reuse bootstrap servers in clients when last alive broker in cluster metadata is unavailable Key: KAFKA-12480 URL: https://issues.apache.org/jira/browse/KAFKA-12480 Project: Kafka Issue Type: Improvement Components: clients Reporter: Ron Dagostino https://issues.apache.org/jira/browse/KAFKA-12455 documented how a Java client can temporarily lose connectivity to a 2-broker cluster that is undergoing a roll because the client will repeatedly retry connecting to the last alive broker that it knows about in the cluster metadata even when that broker is unavailable. The client could potentially fallback to its bootstrap brokers in this case and reconnect to the cluster quicker. For example, assume a 2-broker cluster has broker IDs 1 and 2 and both appear in the bootstrap servers for a consumer. Assume broker 1 rolls such that the Java consumer receives a new METADATA response and only knows about broker 2 being alive, and then broker 2 rolls before the consumer gets a new METADATA response indicating that broker 1 is also alive. At this point the Java consumer will keep retrying broker 2, and it will not reconnect to the cluster unless/until broker 2 becomes available -- or the client itself is restarted so it can use its bootstrap servers again. Another possibility is to fallback to the full bootstrap servers list when the last alive broker becomes unavailable. I believe librdkafka-based client may perform this fallback, though I am not certain. We should consider it for Java clients. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-12505) Should kafka-storage.sh accept a non-UUID for its --cluster-id parameter?
Ron Dagostino created KAFKA-12505: - Summary: Should kafka-storage.sh accept a non-UUID for its --cluster-id parameter? Key: KAFKA-12505 URL: https://issues.apache.org/jira/browse/KAFKA-12505 Project: Kafka Issue Type: New Feature Reporter: Ron Dagostino Should StorageTool support accepting non-UUIDs via its --cluster-id argument? One purpose of the tool is to minimize the chance that a broker could use data from the wrong volume (i.e. data from another cluster). Generating a random UUID via the --random-uuid parameter encourages using a globally unique value for every cluster and is consistent with the behavior today with ZooKeeper, whereas allowing a non-UUID here would increase the chance that someone could reuse a Cluster ID value across clusters and short-circuit the risk mitigation that this tool provides. Discuss... -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-12488) Be more specific about enabled SASL mechnanisms in system tests
Ron Dagostino created KAFKA-12488: - Summary: Be more specific about enabled SASL mechnanisms in system tests Key: KAFKA-12488 URL: https://issues.apache.org/jira/browse/KAFKA-12488 Project: Kafka Issue Type: Improvement Components: system tests Reporter: Ron Dagostino The `SecurityConfig.enabled_sasl_mechanisms()` method simply returns all SASL mechanisms that are enabled for the test -- whether for brokers, clients, controllers, or Zookeeper. These enabled mechanisms are used in JAAS config files to determine what appears in those config files. For example, the entire list of enabled mechanisms is used in both KafkaClient{} and KafkaServer{} sections, but that's way too broad. We should be more precise about what mechanisms we are interested in for the different sections of these JAAS config files. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-12374) Add missing config sasl.mechanism.controller.protocol
Ron Dagostino created KAFKA-12374: - Summary: Add missing config sasl.mechanism.controller.protocol Key: KAFKA-12374 URL: https://issues.apache.org/jira/browse/KAFKA-12374 Project: Kafka Issue Type: Bug Components: config Affects Versions: 2.8.0 Reporter: Ron Dagostino Assignee: Ron Dagostino Fix For: 2.8.0 The config `sasl.mechanism.controller.protocol` from KIP-631 is not implemented. Furthermore, `KafkaRaftManager` is using inter-broker security information when it connects to the Raft controller quorum. KafkaRaftClient should use the first entry in `controller.listener.names` to determine the listener name; that listener name's mapped value in the `listener.security.protocol.map` (if such a mapping exists, otherwise the listener name itself) for the security protocol; and the value of `sasl.mechanism.controller.protocol` for the SASL mechanism. Finally, `RaftControllerNodeProvider` needs to use the value of `sasl.mechanism.controller.protocol` instead of the inter-broker sasl mechanism (it currently determines the listener name and security protocol correctly) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-12402) client_sasl_mechanism should be an explicit list instead of a .csv string
Ron Dagostino created KAFKA-12402: - Summary: client_sasl_mechanism should be an explicit list instead of a .csv string Key: KAFKA-12402 URL: https://issues.apache.org/jira/browse/KAFKA-12402 Project: Kafka Issue Type: Improvement Components: system tests Reporter: Ron Dagostino The SecurityConfig and the KafkaService system test classes both accept a client_sasl_mechanism parameter. This is typically a single value (e.g. PLAIN), but DelegationTokenTest sets self.kafka.client_sasl_mechanism = 'GSSAPI,SCRAM-SHA-256'. If we need to support a list of mechanisms then the parameter should be an explicit list instead of a .csv string. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-12348) The metadata module currently uses Yammer metrics. Should it uses Kafka metrics instead?
Ron Dagostino created KAFKA-12348: - Summary: The metadata module currently uses Yammer metrics. Should it uses Kafka metrics instead? Key: KAFKA-12348 URL: https://issues.apache.org/jira/browse/KAFKA-12348 Project: Kafka Issue Type: Task Components: metrics Affects Versions: 3.0.0, 2.8.0 Reporter: Ron Dagostino -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-13219) BrokerState metric not working for KRaft clusters
Ron Dagostino created KAFKA-13219: - Summary: BrokerState metric not working for KRaft clusters Key: KAFKA-13219 URL: https://issues.apache.org/jira/browse/KAFKA-13219 Project: Kafka Issue Type: Bug Components: kraft Affects Versions: 3.0.0 Reporter: Ron Dagostino Assignee: Ron Dagostino The BrokerState metric always has a value of 0, for NOT_RUNNING, in KRaft clusters -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-13224) broker.id does not appear in config's originals map when setting just node.id
Ron Dagostino created KAFKA-13224: - Summary: broker.id does not appear in config's originals map when setting just node.id Key: KAFKA-13224 URL: https://issues.apache.org/jira/browse/KAFKA-13224 Project: Kafka Issue Type: Bug Affects Versions: 3.0.0 Reporter: Ron Dagostino Assignee: Ron Dagostino Plugins may expect broker.id to exist as a key in the config's various originals()-related maps, but with KRaft we rely solely on node.id for the broker's ID, and with the Zk-based brokers we provide the option to specify node.id in addition to (or as a full replacement for) broker.id. There are multiple problems related to this switch to node.id: # We do not enforce consistency between explicitly-specified broker.id and node.id properties in the config -- it is entirely possible right now that we could set broker.id=0 and also set node.id=1, and the broker will use 1 for it's ID. This is confusing at best; the broker should detect this inconsistency and fail to start with a ConfigException. # When node.id is set, both that value and any explicitly-set broker.id value will exist in the config's *originals()-related maps*. Downstream components are often configured based on these maps, and they may ask for the broker.id, so downstream components may be misconfigured if the values differ, or they may fail during configuration if no broker.id key exists in the map at all. # The config's *values()-related maps* will contain either the explicitly-specified broker.id value or the default value of -1. When node.id is set, both that value (which cannot be negative) and the (potentially -1) broker.id value will exist in the config's values()-related maps. Downstream components are often configured based on these maps, and they may ask for the broker.id, so downstream components may be misconfigured if the broker.id value differs from the broker's true ID. The broker should detect inconsistency between explicitly-specified broker.id and node.id values and fail startup accordingly. It should also ensures that the config's originals()- and values()-related maps contain the same mapped values for both broker.id and node.id keys. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-13270) Kafka may fail to connect to ZooKeeper, retry forever, and never start
Ron Dagostino created KAFKA-13270: - Summary: Kafka may fail to connect to ZooKeeper, retry forever, and never start Key: KAFKA-13270 URL: https://issues.apache.org/jira/browse/KAFKA-13270 Project: Kafka Issue Type: Bug Affects Versions: 3.0.0 Reporter: Ron Dagostino Assignee: Ron Dagostino Fix For: 3.0.0 The implementation of https://issues.apache.org/jira/browse/ZOOKEEPER-3593 in ZooKeeper version 3.6.0 decreased the default value for the ZooKeeper client's `jute.maxbuffer` configuration from 4MB to 1MB. This can cause a problem if Kafka tries to retrieve a large amount of data across many znodes -- in such a case the ZooKeeper client will repeatedly emit a message of the form "java.io.IOException: Packet len <> is out of range" and the Kafka broker will never connect to ZooKeeper and fail make progress on the startup sequence. We can avoid the potential for this issue to occur by explicitly setting the value to 4MB whenever we create a new ZooKeeper client as long as no explicit value has been set via the `jute.maxbuffer` system property. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-13192) broker.id and node.id can be specified inconsistently
Ron Dagostino created KAFKA-13192: - Summary: broker.id and node.id can be specified inconsistently Key: KAFKA-13192 URL: https://issues.apache.org/jira/browse/KAFKA-13192 Project: Kafka Issue Type: Bug Affects Versions: 3.0.0 Reporter: Ron Dagostino If both broker.id and node.id are set, and they are set inconsistently (e.g.broker.id=0, node.id=1) then the value of node.id is used and the broker.id value is left at the original value. The server should detect this inconsistency, throw a ConfigException, and fail to start. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-13456) controller.listener.names is required for all KRaft nodes, not just controllers
Ron Dagostino created KAFKA-13456: - Summary: controller.listener.names is required for all KRaft nodes, not just controllers Key: KAFKA-13456 URL: https://issues.apache.org/jira/browse/KAFKA-13456 Project: Kafka Issue Type: Bug Affects Versions: 3.0.0, 2.8.0, 3.1.0 Reporter: Ron Dagostino Assignee: Ron Dagostino The controller.listener.names config is currently checked for existence when the process.roles contains the controller role (i.e. process.roles=controller or process.roles=broker,contrtoller); it is not checked for existence when process.roles=broker. However, KRaft brokers have to talk to KRaft controllers, of course, and they do so by taking the first entry in the controller.listener.names list. Therefore, controller.listener.names is required in KRaft mode even when process.roles.broker. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (KAFKA-13552) Unable to dynamically change broker log levels on KRaft
Ron Dagostino created KAFKA-13552: - Summary: Unable to dynamically change broker log levels on KRaft Key: KAFKA-13552 URL: https://issues.apache.org/jira/browse/KAFKA-13552 Project: Kafka Issue Type: Bug Components: kraft Affects Versions: 3.0.0, 3.1.0 Reporter: Ron Dagostino It is currently not possible to dynamically change the log level in KRaft. For example: kafka-configs.sh --bootstrap-server --alter --add-config "kafka.server.ReplicaManager=DEBUG" --entity-type broker-loggers --entity-name 0 Results in: org.apache.kafka.common.errors.InvalidRequestException: Unexpected resource type BROKER_LOGGER. The code to process this request is in ZkAdminManager.alterLogLevelConfigs(). This needs to be moved out of there, and the functionality has to be processed locally on the broker instead of being forwarded to the KRaft controller. It is also an open question as to how we can dynamically alter log levels for a remote KRaft controller. Connecting directly to it is one possible solution, but that may not be desirable since generally connecting directly to the controller is not necessary. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (KAFKA-13069) Add magic number to DefaultKafkaPrincipalBuilder.KafkaPrincipalSerde
[ https://issues.apache.org/jira/browse/KAFKA-13069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ron Dagostino resolved KAFKA-13069. --- Resolution: Invalid Flexible fields are sufficient as per KIP-590 VOTE email thread, so a magic number will not be needed. > Add magic number to DefaultKafkaPrincipalBuilder.KafkaPrincipalSerde > > > Key: KAFKA-13069 > URL: https://issues.apache.org/jira/browse/KAFKA-13069 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.0.0, 2.8.0 >Reporter: Ron Dagostino >Assignee: Ron Dagostino >Priority: Critical > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-13140) KRaft brokers do not expose kafka.controller metrics, breaking backwards compatibility
Ron Dagostino created KAFKA-13140: - Summary: KRaft brokers do not expose kafka.controller metrics, breaking backwards compatibility Key: KAFKA-13140 URL: https://issues.apache.org/jira/browse/KAFKA-13140 Project: Kafka Issue Type: Bug Components: kraft Affects Versions: 2.8.0, 3.0.0 Reporter: Ron Dagostino Assignee: Ron Dagostino Fix For: 3.1.0 The following controller metrics are exposed on every broker in a ZooKeeper-based (i.e. non-KRaft) cluster regardless of whether the broker is the active controller or not, but these metrics are not exposed on KRaft nodes that have process.roles=broker (i.e. KRaft nodes that do not implement the controller role). For backwards compatibility, KRaft nodes that are just brokers should expose these metrics with values all equal to 0: just like ZooKeeper-based brokers do when they are not the active controller. kafka.controller:type=KafkaController,name=ActiveControllerCount kafka.controller:type=KafkaController,name=GlobalTopicCount kafka.controller:type=KafkaController,name=GlobalPartitionCount kafka.controller:type=KafkaController,name=OfflinePartitionsCount kafka.controller:type=KafkaController,name=PreferredReplicaImbalanceCount -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-13137) KRaft Controller Metric MBean names are incorrectly quoted
Ron Dagostino created KAFKA-13137: - Summary: KRaft Controller Metric MBean names are incorrectly quoted Key: KAFKA-13137 URL: https://issues.apache.org/jira/browse/KAFKA-13137 Project: Kafka Issue Type: Bug Components: controller Affects Versions: 2.8.0, 3.0.0 Reporter: Ron Dagostino Assignee: Ron Dagostino Fix For: 3.0.0 QuorumControllerMetrics is letting com.yammer.metrics.MetricName create the MBean names for all of the controller metrics, and that adds quotes. We have typically used KafkaMetricsGroup to explicitly create the MBean name, and we do not add quotes there. The controller metric names that are in common between the old and new controller must remain the same, but they are not. For example, this non-KRaft MBean name: kafka.controller:type=KafkaController,name=OfflinePartitionsCount has morphed into this when using KRaft: "kafka.controller":type="KafkaController",name="OfflinePartitionsCount" -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-15495) KRaft partition truncated when the only ISR member restarts with and empty disk
Ron Dagostino created KAFKA-15495: - Summary: KRaft partition truncated when the only ISR member restarts with and empty disk Key: KAFKA-15495 URL: https://issues.apache.org/jira/browse/KAFKA-15495 Project: Kafka Issue Type: Bug Affects Versions: 3.5.1, 3.4.1, 3.3.2, 3.6.0 Reporter: Ron Dagostino Assume a topic-partition in KRaft has just a single leader replica in the ISR. Assume next that this replica goes offline. This replica's log will define the contents of that partition when the replica restarts, which is correct behavior. However, assume now that the replica has a disk failure, and we then replace the failed disk with a new, empty disk that we also format with the storage tool so it has the correct cluster ID. If we then restart the broker, the topic-partition will have no data in it, and any other replicas that might exist will truncate their logs to match. See below for a step-by-step demo of how to reproduce this. [KIP-858: Handle JBOD broker disk failure in KRaft|https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft] introduces the concept of a Disk UUID that we can use to solve this problem. Specifically, when the leader restarts with an empty (but correctly-formatted) disk, the actual UUID associated with the disk will be different. The controller will notice upon broker re-registration that its disk UUID differs from what was previously registered. Right now we have no way of detecting this situation, but the disk UUID gives us that capability. STEPS TO REPRODUCE: Create a single broker cluster with single controller. The standard files under config/kraft work well: bin/kafka-storage.sh random-uuid J8qXRwI-Qyi2G0guFTiuYw # ensure we start clean /bin/rm -rf /tmp/kraft-broker-logs /tmp/kraft-controller-logs bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config config/kraft/controller.properties bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config config/kraft/broker.properties bin/kafka-server-start.sh config/kraft/controller.properties bin/kafka-server-start.sh config/kraft/broker.properties bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo1 --partitions 1 --replication-factor 1 # create __consumer-offsets topics bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic foo1 --from-beginning ^C # confirm that __consumer_offsets topic partitions are all created and on broker with node id 2 bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe Now create 2 more brokers, with node IDs 3 and 4 cat config/kraft/broker.properties | sed 's/node.id=2/node.id=11/' | sed 's/localhost:9092/localhost:9011/g' | sed 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs11#' > config/kraft/broker11.properties cat config/kraft/broker.properties | sed 's/node.id=2/node.id=12/' | sed 's/localhost:9092/localhost:9012/g' | sed 's#log.dirs=/tmp/kraft-broker-logs#log.dirs=/tmp/kraft-broker-logs12#' > config/kraft/broker12.properties # ensure we start clean /bin/rm -rf /tmp/kraft-broker-logs11 /tmp/kraft-broker-logs12 bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config config/kraft/broker11.properties bin/kafka-storage.sh format --cluster-id J8qXRwI-Qyi2G0guFTiuYw --config config/kraft/broker12.properties bin/kafka-server-start.sh config/kraft/broker11.properties bin/kafka-server-start.sh config/kraft/broker12.properties # create a topic with a single partition replicated on two brokers bin/kafka-topics.sh --bootstrap-server localhost:9092 --create --topic foo2 --partitions 1 --replication-factor 2 # reassign partitions onto brokers with Node IDs 11 and 12 cat > /tmp/reassign.json <
[jira] [Resolved] (KAFKA-15219) Support delegation tokens in KRaft
[ https://issues.apache.org/jira/browse/KAFKA-15219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ron Dagostino resolved KAFKA-15219. --- Fix Version/s: 3.6.0 Resolution: Fixed > Support delegation tokens in KRaft > -- > > Key: KAFKA-15219 > URL: https://issues.apache.org/jira/browse/KAFKA-15219 > Project: Kafka > Issue Type: Improvement >Affects Versions: 3.6.0 >Reporter: Viktor Somogyi-Vass >Assignee: Proven Provenzano >Priority: Critical > Fix For: 3.6.0 > > > Delegation tokens have been created in KIP-48 and improved in KIP-373. KRaft > enabled the way to supporting them in KIP-900 by adding SCRAM support but > delegation tokens still don't support KRaft. > There are multiple issues: > - TokenManager still would try to create tokens in Zookeeper. Instead of this > we should forward admin requests to the controller that would store them in > the metadata similarly to SCRAM. We probably won't need new protocols just > enveloping similarly to other existing controller requests. > - TokenManager should run on Controller nodes only (or in mixed mode). > - Integration tests will need to be adapted as well and parameterize them > with Zookeeper/KRaft. > - Documentation needs to be improved to factor in KRaft. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14056) Test reading of old messages formats in ZK-to-KRaft upgrade test
Ron Dagostino created KAFKA-14056: - Summary: Test reading of old messages formats in ZK-to-KRaft upgrade test Key: KAFKA-14056 URL: https://issues.apache.org/jira/browse/KAFKA-14056 Project: Kafka Issue Type: Task Components: kraft Reporter: Ron Dagostino Whenever we support ZK-to-KRaft upgrade we must confirm that we can still read messages with an older message format. We can no longer write such messages as of IBP 3.0 (which is the minimum supported with KRaft), but we must still support reading such messages with KRaft. Therefore, the only way to test this would be to write the messages with a non-KRaft cluster, upgrade to KRaft, and then confirm we can read those messages. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14057) Support dynamic reconfiguration in KRaft remote controllers
Ron Dagostino created KAFKA-14057: - Summary: Support dynamic reconfiguration in KRaft remote controllers Key: KAFKA-14057 URL: https://issues.apache.org/jira/browse/KAFKA-14057 Project: Kafka Issue Type: Task Reporter: Ron Dagostino We currently do not support dynamic reconfiguration of KRaft remote controllers. We only wire up brokers and react to metadata log changes there. We do no such wiring or reacting in a node where process.roles=controller. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14051) KRaft remote controllers do not create metrics reporters
Ron Dagostino created KAFKA-14051: - Summary: KRaft remote controllers do not create metrics reporters Key: KAFKA-14051 URL: https://issues.apache.org/jira/browse/KAFKA-14051 Project: Kafka Issue Type: Bug Components: kraft Affects Versions: 3.3 Reporter: Ron Dagostino KRaft remote controllers (KRaft nodes with the configuration value process.roles=controller) do not create the configured metrics reporters defined by the configuration key metric.reporters. The reason is because KRaft remote controllers are not wired up for dynamic config changes, and the creation of the configured metric reporters actually happens during the wiring up of the broker for dynamic reconfiguration, in the invocation of DynamicBrokerConfig.addReconfigurables(KafkaBroker). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14105) Remove quorum.all_non_upgrade for system tests
Ron Dagostino created KAFKA-14105: - Summary: Remove quorum.all_non_upgrade for system tests Key: KAFKA-14105 URL: https://issues.apache.org/jira/browse/KAFKA-14105 Project: Kafka Issue Type: Task Components: kraft, system tests Reporter: Ron Dagostino We defined `all_non_upgrade = [zk, remote_kraft]` in `quorum.py` to encapsulate the quorum(s) that we want system tests to generally run with when they are unrelated to upgrading. The idea was that we would just annotate tests with that and then we would be able to change the definition of it as we move through and beyond the KRaft bridge release. But it is confusing, and search-and-replace is cheap -- especially if we are only doing it once or twice over the course of the project. So we should eliminate the definition of `quorum.all_non_upgrade` (which was intended to be mutable over the course of the project) in favor of something like `zk_and_remote_kraft`, which will forever list ZK and REMOTE_KRAFT. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14195) Fix KRaft AlterConfig policy usage for Legacy/Full case
Ron Dagostino created KAFKA-14195: - Summary: Fix KRaft AlterConfig policy usage for Legacy/Full case Key: KAFKA-14195 URL: https://issues.apache.org/jira/browse/KAFKA-14195 Project: Kafka Issue Type: Bug Affects Versions: 3.3 Reporter: Ron Dagostino Assignee: Ron Dagostino The fix for https://issues.apache.org/jira/browse/KAFKA-14039 adjusted the invocation of the alter configs policy check in KRaft to match the behavior in ZooKeeper, which is to only provide the configs that were explicitly sent in the request. While the code was correct for the incremental alter configs case, the code actually included the implicit deletions for the legacy/non-incremental alter configs case, and those implicit deletions are not included in the ZooKeeper-based invocation. The implicit deletions should not be passed in the legacy case. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-14051) KRaft remote controllers do not create metrics reporters
[ https://issues.apache.org/jira/browse/KAFKA-14051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ron Dagostino resolved KAFKA-14051. --- Resolution: Fixed > KRaft remote controllers do not create metrics reporters > > > Key: KAFKA-14051 > URL: https://issues.apache.org/jira/browse/KAFKA-14051 > Project: Kafka > Issue Type: Bug > Components: kraft >Affects Versions: 3.3 >Reporter: Ron Dagostino >Assignee: Ron Dagostino >Priority: Major > > KRaft remote controllers (KRaft nodes with the configuration value > process.roles=controller) do not create the configured metrics reporters > defined by the configuration key metric.reporters. The reason is because > KRaft remote controllers are not wired up for dynamic config changes, and the > creation of the configured metric reporters actually happens during the > wiring up of the broker for dynamic reconfiguration, in the invocation of > DynamicBrokerConfig.addReconfigurables(KafkaBroker). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-14392) KRaft broker heartbeat timeout should not exceed broker.session.timeout.ms
[ https://issues.apache.org/jira/browse/KAFKA-14392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ron Dagostino resolved KAFKA-14392. --- Fix Version/s: 3.5.0 Resolution: Fixed > KRaft broker heartbeat timeout should not exceed broker.session.timeout.ms > -- > > Key: KAFKA-14392 > URL: https://issues.apache.org/jira/browse/KAFKA-14392 > Project: Kafka > Issue Type: Improvement > Components: kraft >Affects Versions: 3.3.0, 3.4.0, 3.3.1, 3.3.2 >Reporter: Ron Dagostino >Assignee: Ron Dagostino >Priority: Minor > Fix For: 3.5.0 > > > KRaft brokers maintain their liveness in the cluster by sending > BROKER_HEARTBEAT requests to the active controller; the active controller > fences a broker if it doesn't receive a heartbeat request from that broker > within the period defined by `broker.session.timeout.ms`. The broker should > use a request timeout for its BROKER_HEARTBEAT requests that is not larger > than the session timeout being used by the controller; doing so creates the > possibility that upon controller failover the broker might not cancel an > existing heartbeat request in time and then subsequently heartbeat to the new > controller to maintain an uninterrupted session in the cluster. In other > words, a failure of the active controller could result in under-replicated > (or under-min ISR) partitions simply due to a delay in brokers heartbeating > to the new controller. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14394) BrokerToControllerChannelManager has 2 separate timeouts
Ron Dagostino created KAFKA-14394: - Summary: BrokerToControllerChannelManager has 2 separate timeouts Key: KAFKA-14394 URL: https://issues.apache.org/jira/browse/KAFKA-14394 Project: Kafka Issue Type: Task Reporter: Ron Dagostino BrokerToControllerChannelManager uses `config.controllerSocketTimeoutMs` as its default `networkClientRetryTimeoutMs` in general, but it does accept a second `retryTimeoutMs`, value -- and then there is exactly one place where second timeout is used: within BrokerToControllerRequestThread. Is this second, separate timeout actually necessary, or is it a bug (in which case the two timeouts should be the same). Closely related to this is the case of AlterPartitionManager, which sends Long.MAX_VALUE as the retryTimeoutMs value when it instantiates its instance of BrokerToControllerChannelManager. Is this Long.MAX_VALUE correct, when in fact `config.controllerSocketTimeoutMs` is being used as the other timeout? This is related to https://issues.apache.org/jira/projects/KAFKA/issues/KAFKA-14392 and the associated PR, https://github.com/apache/kafka/pull/12856 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14392) KRaft should comment controller.socket.timeout.ms <= broker.session.timeout.ms
Ron Dagostino created KAFKA-14392: - Summary: KRaft should comment controller.socket.timeout.ms <= broker.session.timeout.ms Key: KAFKA-14392 URL: https://issues.apache.org/jira/browse/KAFKA-14392 Project: Kafka Issue Type: Improvement Reporter: Ron Dagostino Assignee: Ron Dagostino KRaft brokers maintain their liveness in the cluster by sending BROKER_HEARTBEAT requests to the active controller; the active controller fences a broker if it doesn't receive a heartbeat request from that broker within the period defined by `broker.session.timeout.ms`. The broker should use a request timeout for its BROKER_HEARTBEAT requests that is not larger than the session timeout being used by the controller; doing so creates the possibility that upon controller failover the broker might not cancel an existing heartbeat request in time and then subsequently heartbeat to the new controller to maintain an uninterrupted session in the cluster. In other words, a failure of the active controller could result in under-replicated (or under-min ISR) partitions simply due to a delay in brokers heartbeating to the new controller. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-14394) BrokerToControllerChannelManager has 2 separate timeouts
[ https://issues.apache.org/jira/browse/KAFKA-14394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ron Dagostino resolved KAFKA-14394. --- Resolution: Not A Problem > BrokerToControllerChannelManager has 2 separate timeouts > > > Key: KAFKA-14394 > URL: https://issues.apache.org/jira/browse/KAFKA-14394 > Project: Kafka > Issue Type: Task >Reporter: Ron Dagostino >Priority: Major > > BrokerToControllerChannelManager uses `config.controllerSocketTimeoutMs` as > its default `networkClientRetryTimeoutMs` in general, but it does accept a > second `retryTimeoutMs`, value -- and then there is exactly one place where > second timeout is used: within BrokerToControllerRequestThread. Is this > second, separate timeout actually necessary, or is it a bug (in which case > the two timeouts should be the same). Closely related to this is the case of > AlterPartitionManager, which sends Long.MAX_VALUE as the retryTimeoutMs value > when it instantiates its instance of BrokerToControllerChannelManager. Is > this Long.MAX_VALUE correct, when in fact `config.controllerSocketTimeoutMs` > is being used as the other timeout? > This is related to > https://issues.apache.org/jira/projects/KAFKA/issues/KAFKA-14392 and the > associated PR, https://github.com/apache/kafka/pull/12856 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14371) quorum-state file contains empty/unused clusterId field
Ron Dagostino created KAFKA-14371: - Summary: quorum-state file contains empty/unused clusterId field Key: KAFKA-14371 URL: https://issues.apache.org/jira/browse/KAFKA-14371 Project: Kafka Issue Type: Improvement Reporter: Ron Dagostino The KRaft controller's quorum-state file `$LOG_DIR/__cluster_metadata-0/quorum-state` contains an empty clusterId value. This value is never non-empty, and it is never used after it is written and then subsequently read. This is a cosmetic issue; it would be best if this value did not exist there. The cluster ID already exists in the `$LOG_DIR/meta.properties` file. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-14351) Implement controller mutation quotas in KRaft
[ https://issues.apache.org/jira/browse/KAFKA-14351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ron Dagostino resolved KAFKA-14351. --- Fix Version/s: 3.5.0 Resolution: Fixed > Implement controller mutation quotas in KRaft > - > > Key: KAFKA-14351 > URL: https://issues.apache.org/jira/browse/KAFKA-14351 > Project: Kafka > Issue Type: Improvement >Reporter: Colin McCabe >Assignee: Ron Dagostino >Priority: Major > Labels: kip-500 > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14735) Improve KRaft metadata image change performance at high topic counts
Ron Dagostino created KAFKA-14735: - Summary: Improve KRaft metadata image change performance at high topic counts Key: KAFKA-14735 URL: https://issues.apache.org/jira/browse/KAFKA-14735 Project: Kafka Issue Type: Improvement Components: kraft Reporter: Ron Dagostino Assignee: Ron Dagostino Fix For: 3.5.0 Performance of KRaft metadata image changes is currently O(<# of topics in cluster>). This means the amount of time it takes to create just a *single* topic scales linearly with the number of topics in the entire cluster. This impact both controllers and brokers because both use the metadata image to represent the KRaft metadata log. The performance of these changes should scale with the number of topics being changed -- so creating a single topic should perform similarly regardless of the number of topics in the cluster. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14731) Upgrade ZooKeeper to 3.6.4
Ron Dagostino created KAFKA-14731: - Summary: Upgrade ZooKeeper to 3.6.4 Key: KAFKA-14731 URL: https://issues.apache.org/jira/browse/KAFKA-14731 Project: Kafka Issue Type: Task Affects Versions: 3.3.2, 3.2.3, 3.4.0, 3.1.2, 3.0.2, 3.5.0 Reporter: Ron Dagostino Assignee: Ron Dagostino Fix For: 3.2.4, 3.1.3, 3.0.3, 3.5.0, 3.4.1, 3.3.3 We have https://issues.apache.org/jira/projects/KAFKA/issues/KAFKA-14661 opened to upgrade ZooKeeper from 3.6.3 to 3.8.1, and that will likely be actioned in time for 3.5.0. But in the meantime, ZooKeeper 3.6.4 has been released, so we should take the patch version bump in trunk now and also apply the bump to the next patch releases of 3.0, 3.1, 3.2, 3.3, and 3.4. Note that KAFKA-14661 should *not* be applied to branches prior to trunk (and presumably 3.5). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-14890) Kafka initiates shutdown due to connectivity problem with Zookeeper and FatalExitError from ChangeNotificationProcessorThread
[ https://issues.apache.org/jira/browse/KAFKA-14890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ron Dagostino resolved KAFKA-14890. --- Resolution: Duplicate Duplicate of https://issues.apache.org/jira/browse/KAFKA-14887 > Kafka initiates shutdown due to connectivity problem with Zookeeper and > FatalExitError from ChangeNotificationProcessorThread > - > > Key: KAFKA-14890 > URL: https://issues.apache.org/jira/browse/KAFKA-14890 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 3.3.2 >Reporter: Denis Razuvaev >Priority: Major > > Hello, > We have faced several times the deadlock in Kafka, the similar issue is - > https://issues.apache.org/jira/browse/KAFKA-13544 > The question - is it expected behavior that Kafka decided to shut down due to > connectivity problems with Zookeeper? Seems like it is related to the > inability to read data from */feature* Zk node and the > _ZooKeeperClientExpiredException_ thrown from _ZooKeeperClient_ class. This > exception is thrown and it is caught only in catch block of _doWork()_ method > in {_}ChangeNotificationProcessorThread{_}, and it leads to > {_}FatalExitError{_}. > This problem with shutdown is reproduced in the new versions of Kafka (which > already have fix regarding deadlock from 13544). > It is hard to write a synthetic test to reproduce problem, but it can be > reproduced locally via debug mode with the following steps: > 1) Start Zookeeper and start Kafka in debug mode. > 2) Emulate connectivity problem between Kafka and Zookeeper, for example > connection can be closed via Netcrusher library. > 3) Put a breakpoint in _updateLatestOrThrow()_ method in > _FeatureCacheUpdater_ class, before > _zkClient.getDataAndVersion(featureZkNodePath)_ line execution. > 4) Restore connection between Kafka and Zookeeper after session expiration. > Kafka execution should be stopped on the breakpoint. > 5) Resume execution until Kafka starts to execute line > _zooKeeperClient.handleRequests(remainingRequests)_ in > _retryRequestsUntilConnected_ method in _KafkaZkClient_ class. > 6) Again emulate connectivity problem between Kafka and Zookeeper and wait > until session will be expired. > 7) Restore connection between Kafka and Zookeeper. > 8) Kafka begins shutdown process, due to: > _ERROR [feature-zk-node-event-process-thread]: Failed to process feature ZK > node change event. The broker will eventually exit. > (kafka.server.FinalizedFeatureChangeListener$ChangeNotificationProcessorThread)_ > > The following problems on the real environment can be caused by some network > problems and periodic disconnection and connection to the Zookeeper in a > short time period. > I started mail thread in > [https://lists.apache.org/thread/gbk4scwd8g7mg2tfsokzj5tjgrjrb9dw] regarding > this problem, but have no answers. > For me it seems like defect, because Kafka initiates shutdown after restoring > connection between Kafka and Zookeeper, and should be fixed. > Thank you. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-14735) Improve KRaft metadata image change performance at high topic counts
[ https://issues.apache.org/jira/browse/KAFKA-14735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ron Dagostino resolved KAFKA-14735. --- Resolution: Fixed > Improve KRaft metadata image change performance at high topic counts > > > Key: KAFKA-14735 > URL: https://issues.apache.org/jira/browse/KAFKA-14735 > Project: Kafka > Issue Type: Improvement > Components: kraft >Reporter: Ron Dagostino >Assignee: Ron Dagostino >Priority: Major > Fix For: 3.6.0 > > > Performance of KRaft metadata image changes is currently O(<# of topics in > cluster>). This means the amount of time it takes to create just a *single* > topic scales linearly with the number of topics in the entire cluster. This > impact both controllers and brokers because both use the metadata image to > represent the KRaft metadata log. The performance of these changes should > scale with the number of topics being changed -- so creating a single topic > should perform similarly regardless of the number of topics in the cluster. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14887) ZK session timeout can cause broker to shutdown
Ron Dagostino created KAFKA-14887: - Summary: ZK session timeout can cause broker to shutdown Key: KAFKA-14887 URL: https://issues.apache.org/jira/browse/KAFKA-14887 Project: Kafka Issue Type: Improvement Affects Versions: 3.3.2, 3.3.1, 3.2.3, 3.2.2, 3.4.0, 3.2.1, 3.1.2, 3.0.2, 3.3.0, 3.1.1, 3.2.0, 2.8.2, 3.0.1, 3.0.0, 2.8.1, 2.7.2, 3.1.0, 2.7.1, 2.8.0, 2.7.0 Reporter: Ron Dagostino We have the following code in FinalizedFeatureChangeListener.scala which will exit regardless of the type of exception that is thrown when trying to process feature changes: case e: Exception => { error("Failed to process feature ZK node change event. The broker will eventually exit.", e) throw new FatalExitError(1) The issue here is that this does not distinguish between exceptions caused by an inability to process a feature change and an exception caused by a ZooKeeper session timeout. We want to shut the broker down for the former case, but we do NOT want to shut the broker down in the latter case; the ZooKeeper session will eventually be reestablished, and we can continue processing at that time. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-14711) kafaka-metadata-quorum.sh does not honor --command-config
Ron Dagostino created KAFKA-14711: - Summary: kafaka-metadata-quorum.sh does not honor --command-config Key: KAFKA-14711 URL: https://issues.apache.org/jira/browse/KAFKA-14711 Project: Kafka Issue Type: Bug Components: kraft Affects Versions: 3.4.0 Reporter: Ron Dagostino Assignee: Ron Dagostino Fix For: 3.4.1 https://github.com/apache/kafka/pull/12951 accidentally eliminated support for the `--command-config` option in the `kafka-metadata-quorum.sh` command. This was an undetected regression in the 3.4.0 release. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-14711) kafaka-metadata-quorum.sh does not honor --command-config
[ https://issues.apache.org/jira/browse/KAFKA-14711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ron Dagostino resolved KAFKA-14711. --- Resolution: Fixed > kafaka-metadata-quorum.sh does not honor --command-config > - > > Key: KAFKA-14711 > URL: https://issues.apache.org/jira/browse/KAFKA-14711 > Project: Kafka > Issue Type: Bug > Components: kraft >Affects Versions: 3.4.0 >Reporter: Ron Dagostino >Assignee: Ron Dagostino >Priority: Critical > Fix For: 3.4.1 > > > https://github.com/apache/kafka/pull/12951 accidentally eliminated support > for the `--command-config` option in the `kafka-metadata-quorum.sh` command. > This was an undetected regression in the 3.4.0 release. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15039) Reduce logging level to trace in PartitionChangeBuilder.tryElection()
Ron Dagostino created KAFKA-15039: - Summary: Reduce logging level to trace in PartitionChangeBuilder.tryElection() Key: KAFKA-15039 URL: https://issues.apache.org/jira/browse/KAFKA-15039 Project: Kafka Issue Type: Improvement Components: kraft Reporter: Ron Dagostino Assignee: Ron Dagostino Fix For: 3.6.0 A CPU profile in a large cluster showed PartitionChangeBuilder.tryElection() taking significant CPU due to logging. Decrease the logging statements in that method from debug level to trace to mitigate the impact of this CPU hog under normal operations. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (KAFKA-15098) KRaft migration does not proceed and broker dies if authorizer.class.name is set
Ron Dagostino created KAFKA-15098: - Summary: KRaft migration does not proceed and broker dies if authorizer.class.name is set Key: KAFKA-15098 URL: https://issues.apache.org/jira/browse/KAFKA-15098 Project: Kafka Issue Type: Bug Components: kraft Affects Versions: 3.5.0 Reporter: Ron Dagostino Assignee: David Arthur java.lang.IllegalArgumentException: requirement failed: ZooKeeper migration does not yet support authorizers. Remove authorizer.class.name before performing a migration. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (KAFKA-15471) Allow independently stop KRaft controllers or brokers
[ https://issues.apache.org/jira/browse/KAFKA-15471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ron Dagostino resolved KAFKA-15471. --- Resolution: Fixed > Allow independently stop KRaft controllers or brokers > - > > Key: KAFKA-15471 > URL: https://issues.apache.org/jira/browse/KAFKA-15471 > Project: Kafka > Issue Type: Improvement >Reporter: Hailey Ni >Assignee: Hailey Ni >Priority: Major > > Some users run KRaft controllers and brokers on the same machine (not > containerized, but through tarballs, etc). Prior to KRaft, when running > ZooKeeper and Kafka on the same machine, users could independently stop the > ZooKeeper node and Kafka broker since there were specific shell scripts for > each (zookeeper-server-stop and kafka-server-stop, respectively). > However in KRaft mode, they can't stop the KRaft controllers independently > from the Kafka brokers because there is just a single script that doesn't > distinguish between the two processes and signals both of them. We need to > provide a way for users to kill either controllers or brokers. -- This message was sent by Atlassian Jira (v8.20.10#820010)