[jira] [Commented] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId
[ https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076895#comment-17076895 ] Raman Gupta commented on KAFKA-8803: We are on 2.4.1 now, locally patched with the fix for KAFKA-9749. So far so good. > Stream will not start due to TimeoutException: Timeout expired after > 6milliseconds while awaiting InitProducerId > > > Key: KAFKA-8803 > URL: https://issues.apache.org/jira/browse/KAFKA-8803 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: Raman Gupta >Assignee: Sophie Blee-Goldman >Priority: Major > Attachments: logs-20200311.txt.gz, logs-client-20200311.txt.gz, > logs.txt.gz, screenshot-1.png > > > One streams app is consistently failing at startup with the following > exception: > {code} > 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] > org.apa.kaf.str.pro.int.StreamTask: task [0_36] Timeout > exception caught when initializing transactions for task 0_36. This might > happen if the broker is slow to respond, if the network connection to the > broker was interrupted, or if similar circumstances arise. You can increase > producer parameter `max.block.ms` to increase this timeout. > org.apache.kafka.common.errors.TimeoutException: Timeout expired after > 6milliseconds while awaiting InitProducerId > {code} > These same brokers are used by many other streams without any issue, > including some in the very same processes for the stream which consistently > throws this exception. > *UPDATE 08/16:* > The very first instance of this error is August 13th 2019, 17:03:36.754 and > it happened for 4 different streams. For 3 of these streams, the error only > happened once, and then the stream recovered. For the 4th stream, the error > has continued to happen, and continues to happen now. > I looked up the broker logs for this time, and see that at August 13th 2019, > 16:47:43, two of four brokers started reporting messages like this, for > multiple partitions: > [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, > fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader > reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread) > The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, > here is a view of the count of these messages over time: > !screenshot-1.png! > However, as noted, the stream task timeout error continues to happen. > I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 > broker. The broker has a patch for KAFKA-8773. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-9827) MM2 doesnt replicate data
Dmitry created KAFKA-9827: - Summary: MM2 doesnt replicate data Key: KAFKA-9827 URL: https://issues.apache.org/jira/browse/KAFKA-9827 Project: Kafka Issue Type: Bug Components: mirrormaker Affects Versions: 2.4.1, 2.4.0 Reporter: Dmitry Attachments: mm2.log I have 2 servers with different node count: # 2.4.0 version, 3 nodes, broker port plaintext:9090 # 2.4.0 version, 1 nodes, broker port plaintext: I want to transfer data from cluster 1 -> cluster 2 by mirror maker 2. MM2 configuration: {code:java} name = mm2-backupFlow topics = .* groups = .* # specify any number of cluster aliases clusters = m1-source, m1-backup # connection information for each cluster m1-source.bootstrap.servers = 172.17.165.49:9090, 172.17.165.50:9090, 172.17.165.51:9090 m1-backup.bootstrap.servers = 172.17.165.52: # enable and configure individual replication flows m1-source->m1-backup.enabled = true m1-soruce->m1-backup.emit.heartbeats.enabled = false {code} When i start MM2 on each clusters created 3 topics: * heartbeat * *.internal And thats all. Nothing to happend. I watching next ERROR's: {code:java} [2020-04-07 07:19:16,821] ERROR Plugin class loader for connector: 'org.apache.kafka.connect.mirror.MirrorHeartbeatConnector' was not found. Returning: org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader@1e5f4170 (org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader:165) {code} {code:java} [2020-04-07 07:19:18,312] ERROR Plugin class loader for connector: 'org.apache.kafka.connect.mirror.MirrorCheckpointConnector' was not found. Returning: org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader@1e5f4170 (org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader:165) {code} {code:java} [2020-04-07 07:19:16,807] ERROR Plugin class loader for connector: 'org.apache.kafka.connect.mirror.MirrorSourceConnector' was not found. Returning: org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader@1e5f4170 (org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader:165){code} Full log in attach What i missed ? Help please -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9818) Flaky Test RecordCollectorTest.shouldNotThrowStreamsExceptionOnSubsequentCallIfASendFailsWithContinueExceptionHandler
[ https://issues.apache.org/jira/browse/KAFKA-9818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076890#comment-17076890 ] Matthias J. Sax commented on KAFKA-9818: This might be related to https://issues.apache.org/jira/browse/KAFKA-9819 and we might want to do the same fix for both. Thank to [~ableegoldman] for pointing it out. > Flaky Test > RecordCollectorTest.shouldNotThrowStreamsExceptionOnSubsequentCallIfASendFailsWithContinueExceptionHandler > - > > Key: KAFKA-9818 > URL: https://issues.apache.org/jira/browse/KAFKA-9818 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: Sophie Blee-Goldman >Assignee: Matthias J. Sax >Priority: Major > Labels: flaky-test > > h3. Error Message > java.lang.AssertionError > h3. Stacktrace > java.lang.AssertionError at org.junit.Assert.fail(Assert.java:87) at > org.junit.Assert.assertTrue(Assert.java:42) at > org.junit.Assert.assertTrue(Assert.java:53) at > org.apache.kafka.streams.processor.internals.RecordCollectorTest.shouldNotThrowStreamsExceptionOnSubsequentCallIfASendFailsWithContinueExceptionHandler(RecordCollectorTest.java:521) > at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:566) at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at > org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) > at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) at > org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) at > org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) at > org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at > org.junit.runners.ParentRunner.run(ParentRunner.java:413) at > org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.runTestClass(JUnitTestClassExecutor.java:110) > at > org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:58) > at > org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:38) > at > org.gradle.api.internal.tasks.testing.junit.AbstractJUnitTestClassProcessor.processTestClass(AbstractJUnitTestClassProcessor.java:62) > at > org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.processTestClass(SuiteTestClassProcessor.java:51) > at jdk.internal.reflect.GeneratedMethodAccessor20.invoke(Unknown Source) at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:566) at > org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:36) > at > org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24) > at > org.gradle.internal.dispatch.ContextClassLoaderDispatch.dispatch(ContextClassLoaderDispatch.java:33) > at > org.gradle.internal.dispatch.ProxyDispatchAdapter$DispatchingInvocationHandler.invoke(ProxyDispatchAdapter.java:94) > at com.sun.proxy.$Proxy5.processTestClass(Unknown Source) at > org.gradle.api.internal.tasks.testing.worker.TestWorker.processTestClass(TestWorker.java:118) > at jdk.internal.reflect.GeneratedMethodAccessor19.invoke(Unknown Source) at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:566) at >
[jira] [Commented] (KAFKA-9819) Flaky Test StoreChangelogReaderTest#shouldNotThrowOnUnknownRevokedPartition[0]
[ https://issues.apache.org/jira/browse/KAFKA-9819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076889#comment-17076889 ] Matthias J. Sax commented on KAFKA-9819: This might be related to https://issues.apache.org/jira/browse/KAFKA-9818 and we might want to do the same fix for both. Thank to [~ableegoldman] for pointing it out. > Flaky Test StoreChangelogReaderTest#shouldNotThrowOnUnknownRevokedPartition[0] > -- > > Key: KAFKA-9819 > URL: https://issues.apache.org/jira/browse/KAFKA-9819 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: Sophie Blee-Goldman >Priority: Major > Labels: flaky-test > > h3. Error Message > java.lang.AssertionError: expected:<[test-reader Changelog partition > unknown-0 could not be found, it could be already cleaned up during the > handlingof task corruption and never restore again]> but was:<[[AdminClient > clientId=adminclient-91] Connection to node -1 (localhost/127.0.0.1:8080) > could not be established. Broker may not be available., test-reader Changelog > partition unknown-0 could not be found, it could be already cleaned up during > the handlingof task corruption and never restore again]> > h3. Stacktrace > java.lang.AssertionError: expected:<[test-reader Changelog partition > unknown-0 could not be found, it could be already cleaned up during the > handlingof task corruption and never restore again]> but was:<[[AdminClient > clientId=adminclient-91] Connection to node -1 (localhost/127.0.0.1:8080) > could not be established. Broker may not be available., test-reader Changelog > partition unknown-0 could not be found, it could be already cleaned up during > the handlingof task corruption and never restore again]> -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KAFKA-9819) Flaky Test StoreChangelogReaderTest#shouldNotThrowOnUnknownRevokedPartition[0]
[ https://issues.apache.org/jira/browse/KAFKA-9819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias J. Sax reassigned KAFKA-9819: -- Assignee: Matthias J. Sax > Flaky Test StoreChangelogReaderTest#shouldNotThrowOnUnknownRevokedPartition[0] > -- > > Key: KAFKA-9819 > URL: https://issues.apache.org/jira/browse/KAFKA-9819 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: Sophie Blee-Goldman >Assignee: Matthias J. Sax >Priority: Major > Labels: flaky-test > > h3. Error Message > java.lang.AssertionError: expected:<[test-reader Changelog partition > unknown-0 could not be found, it could be already cleaned up during the > handlingof task corruption and never restore again]> but was:<[[AdminClient > clientId=adminclient-91] Connection to node -1 (localhost/127.0.0.1:8080) > could not be established. Broker may not be available., test-reader Changelog > partition unknown-0 could not be found, it could be already cleaned up during > the handlingof task corruption and never restore again]> > h3. Stacktrace > java.lang.AssertionError: expected:<[test-reader Changelog partition > unknown-0 could not be found, it could be already cleaned up during the > handlingof task corruption and never restore again]> but was:<[[AdminClient > clientId=adminclient-91] Connection to node -1 (localhost/127.0.0.1:8080) > could not be established. Broker may not be available., test-reader Changelog > partition unknown-0 could not be found, it could be already cleaned up during > the handlingof task corruption and never restore again]> -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9818) Flaky Test RecordCollectorTest.shouldNotThrowStreamsExceptionOnSubsequentCallIfASendFailsWithContinueExceptionHandler
[ https://issues.apache.org/jira/browse/KAFKA-9818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076888#comment-17076888 ] ASF GitHub Bot commented on KAFKA-9818: --- mjsax commented on pull request #8423: KAFKA-9818: improve error message to debug test URL: https://github.com/apache/kafka/pull/8423 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Flaky Test > RecordCollectorTest.shouldNotThrowStreamsExceptionOnSubsequentCallIfASendFailsWithContinueExceptionHandler > - > > Key: KAFKA-9818 > URL: https://issues.apache.org/jira/browse/KAFKA-9818 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: Sophie Blee-Goldman >Assignee: Matthias J. Sax >Priority: Major > Labels: flaky-test > > h3. Error Message > java.lang.AssertionError > h3. Stacktrace > java.lang.AssertionError at org.junit.Assert.fail(Assert.java:87) at > org.junit.Assert.assertTrue(Assert.java:42) at > org.junit.Assert.assertTrue(Assert.java:53) at > org.apache.kafka.streams.processor.internals.RecordCollectorTest.shouldNotThrowStreamsExceptionOnSubsequentCallIfASendFailsWithContinueExceptionHandler(RecordCollectorTest.java:521) > at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:566) at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at > org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) > at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) at > org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) at > org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) at > org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at > org.junit.runners.ParentRunner.run(ParentRunner.java:413) at > org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.runTestClass(JUnitTestClassExecutor.java:110) > at > org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:58) > at > org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:38) > at > org.gradle.api.internal.tasks.testing.junit.AbstractJUnitTestClassProcessor.processTestClass(AbstractJUnitTestClassProcessor.java:62) > at > org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.processTestClass(SuiteTestClassProcessor.java:51) > at jdk.internal.reflect.GeneratedMethodAccessor20.invoke(Unknown Source) at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:566) at > org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:36) > at > org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24) > at > org.gradle.internal.dispatch.ContextClassLoaderDispatch.dispatch(ContextClassLoaderDispatch.java:33) > at > org.gradle.internal.dispatch.ProxyDispatchAdapter$DispatchingInvocationHandler.invoke(ProxyDispatchAdapter.java:94) > at com.sun.proxy.$Proxy5.processTestClass(Unknown Source) at > org.gradle.api.internal.tasks.testing.worker.TestWorker.processTestClass(TestWorker.java:118) > at
[jira] [Commented] (KAFKA-8264) Flaky Test PlaintextConsumerTest#testLowMaxFetchSizeForRequestAndPartition
[ https://issues.apache.org/jira/browse/KAFKA-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076887#comment-17076887 ] Matthias J. Sax commented on KAFKA-8264: [https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/1645/testReport/junit/kafka.api/PlaintextConsumerTest/testLowMaxFetchSizeForRequestAndPartition/] > Flaky Test PlaintextConsumerTest#testLowMaxFetchSizeForRequestAndPartition > -- > > Key: KAFKA-8264 > URL: https://issues.apache.org/jira/browse/KAFKA-8264 > Project: Kafka > Issue Type: Bug > Components: core, unit tests >Affects Versions: 2.0.1, 2.3.0 >Reporter: Matthias J. Sax >Assignee: Stanislav Kozlovski >Priority: Critical > Labels: flaky-test > Fix For: 2.6.0 > > > [https://builds.apache.org/blue/organizations/jenkins/kafka-2.0-jdk8/detail/kafka-2.0-jdk8/252/tests] > {quote}org.apache.kafka.common.errors.TopicExistsException: Topic 'topic3' > already exists.{quote} > STDOUT > > {quote}[2019-04-19 03:54:20,080] ERROR [ReplicaFetcher replicaId=1, > leaderId=0, fetcherId=0] Error for partition __consumer_offsets-0 at offset 0 > (kafka.server.ReplicaFetcherThread:76) > org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server > does not host this topic-partition. > [2019-04-19 03:54:20,080] ERROR [ReplicaFetcher replicaId=2, leaderId=0, > fetcherId=0] Error for partition __consumer_offsets-0 at offset 0 > (kafka.server.ReplicaFetcherThread:76) > org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server > does not host this topic-partition. > [2019-04-19 03:54:20,312] ERROR [ReplicaFetcher replicaId=2, leaderId=0, > fetcherId=0] Error for partition topic-0 at offset 0 > (kafka.server.ReplicaFetcherThread:76) > org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server > does not host this topic-partition. > [2019-04-19 03:54:20,313] ERROR [ReplicaFetcher replicaId=2, leaderId=1, > fetcherId=0] Error for partition topic-1 at offset 0 > (kafka.server.ReplicaFetcherThread:76) > org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server > does not host this topic-partition. > [2019-04-19 03:54:20,994] ERROR [ReplicaFetcher replicaId=1, leaderId=0, > fetcherId=0] Error for partition topic-0 at offset 0 > (kafka.server.ReplicaFetcherThread:76) > org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server > does not host this topic-partition. > [2019-04-19 03:54:21,727] ERROR [ReplicaFetcher replicaId=0, leaderId=2, > fetcherId=0] Error for partition topicWithNewMessageFormat-1 at offset 0 > (kafka.server.ReplicaFetcherThread:76) > org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server > does not host this topic-partition. > [2019-04-19 03:54:28,696] ERROR [ReplicaFetcher replicaId=0, leaderId=2, > fetcherId=0] Error for partition __consumer_offsets-0 at offset 0 > (kafka.server.ReplicaFetcherThread:76) > org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server > does not host this topic-partition. > [2019-04-19 03:54:28,699] ERROR [ReplicaFetcher replicaId=1, leaderId=2, > fetcherId=0] Error for partition __consumer_offsets-0 at offset 0 > (kafka.server.ReplicaFetcherThread:76) > org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server > does not host this topic-partition. > [2019-04-19 03:54:29,246] ERROR [ReplicaFetcher replicaId=0, leaderId=2, > fetcherId=0] Error for partition topic-1 at offset 0 > (kafka.server.ReplicaFetcherThread:76) > org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server > does not host this topic-partition. > [2019-04-19 03:54:29,247] ERROR [ReplicaFetcher replicaId=0, leaderId=1, > fetcherId=0] Error for partition topic-0 at offset 0 > (kafka.server.ReplicaFetcherThread:76) > org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server > does not host this topic-partition. > [2019-04-19 03:54:29,287] ERROR [ReplicaFetcher replicaId=1, leaderId=2, > fetcherId=0] Error for partition topic-1 at offset 0 > (kafka.server.ReplicaFetcherThread:76) > org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server > does not host this topic-partition. > [2019-04-19 03:54:33,408] ERROR [ReplicaFetcher replicaId=0, leaderId=2, > fetcherId=0] Error for partition __consumer_offsets-0 at offset 0 > (kafka.server.ReplicaFetcherThread:76) > org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server > does not host this topic-partition. > [2019-04-19 03:54:33,408] ERROR [ReplicaFetcher replicaId=1, leaderId=2, > fetcherId=0] Error for partition __consumer_offsets-0 at offset 0 > (kafka.server.ReplicaFetcherThread:76) >
[jira] [Commented] (KAFKA-7965) Flaky Test ConsumerBounceTest#testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup
[ https://issues.apache.org/jira/browse/KAFKA-7965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076886#comment-17076886 ] Matthias J. Sax commented on KAFKA-7965: [https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/1645/testReport/junit/kafka.api/ConsumerBounceTest/testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup/] and [https://builds.apache.org/job/kafka-pr-jdk11-scala2.13/5664/testReport/junit/kafka.api/ConsumerBounceTest/testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup/] > Flaky Test > ConsumerBounceTest#testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup > > > Key: KAFKA-7965 > URL: https://issues.apache.org/jira/browse/KAFKA-7965 > Project: Kafka > Issue Type: Bug > Components: clients, consumer, unit tests >Affects Versions: 1.1.1, 2.2.0, 2.3.0 >Reporter: Matthias J. Sax >Priority: Critical > Labels: flaky-test > Fix For: 2.3.0 > > > To get stable nightly builds for `2.2` release, I create tickets for all > observed test failures. > [https://jenkins.confluent.io/job/apache-kafka-test/job/2.2/21/] > {quote}java.lang.AssertionError: Received 0, expected at least 68 at > org.junit.Assert.fail(Assert.java:88) at > org.junit.Assert.assertTrue(Assert.java:41) at > kafka.api.ConsumerBounceTest.receiveAndCommit(ConsumerBounceTest.scala:557) > at > kafka.api.ConsumerBounceTest.$anonfun$testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup$1(ConsumerBounceTest.scala:320) > at > kafka.api.ConsumerBounceTest.$anonfun$testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup$1$adapted(ConsumerBounceTest.scala:319) > at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at > kafka.api.ConsumerBounceTest.testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup(ConsumerBounceTest.scala:319){quote} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-6145) Warm up new KS instances before migrating tasks - potentially a two phase rebalance
[ https://issues.apache.org/jira/browse/KAFKA-6145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076870#comment-17076870 ] ASF GitHub Bot commented on KAFKA-6145: --- ableegoldman commented on pull request #8436: KAFKA-6145: KIP-441 avoid unnecessary movement of standbys URL: https://github.com/apache/kafka/pull/8436 Currently we add warmup and standby tasks, meaning we first assign up to max.warmup.replica warmup tasks, and then attempt to assign num.standby copies of each stateful task. This can cause unnecessary transient standbys to pop up for the lifetime of the warmup task, which are presumably not what the user wanted. Note that we don’t want to simply count all warmups against the configured num.standbys, as this may cause the opposite problem where a standby we intend to keep is temporarily unassigned (which may lead to the cleanup thread deleting it). We should only count this as a standby if the destination client already had this task as a standby; otherwise, the standby already exists on some other client, so we should aim to give it back. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Warm up new KS instances before migrating tasks - potentially a two phase > rebalance > --- > > Key: KAFKA-6145 > URL: https://issues.apache.org/jira/browse/KAFKA-6145 > Project: Kafka > Issue Type: New Feature > Components: streams >Reporter: Antony Stubbs >Assignee: Sophie Blee-Goldman >Priority: Major > Labels: needs-kip > > Currently when expanding the KS cluster, the new node's partitions will be > unavailable during the rebalance, which for large states can take a very long > time, or for small state stores even more than a few ms can be a deal breaker > for micro service use cases. > One workaround would be two execute the rebalance in two phases: > 1) start running state store building on the new node > 2) once the state store is fully populated on the new node, only then > rebalance the tasks - there will still be a rebalance pause, but would be > greatly reduced > Relates to: KAFKA-6144 - Allow state stores to serve stale reads during > rebalance -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-9826) Log cleaning repeatedly picks same segment with no effect when first dirty offset is past start of active segment
[ https://issues.apache.org/jira/browse/KAFKA-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Rodrigues updated KAFKA-9826: --- Summary: Log cleaning repeatedly picks same segment with no effect when first dirty offset is past start of active segment (was: Log cleaning goes in infinite loop when first dirty offset is past start of active segment) > Log cleaning repeatedly picks same segment with no effect when first dirty > offset is past start of active segment > - > > Key: KAFKA-9826 > URL: https://issues.apache.org/jira/browse/KAFKA-9826 > Project: Kafka > Issue Type: Bug > Components: log cleaner >Affects Versions: 2.4.1 >Reporter: Steve Rodrigues >Assignee: Steve Rodrigues >Priority: Major > > Seen on a system where a given partition had a single segment, and for > whatever reason (deleteRecords?), the logStartOffset was greater than the > base segment of the log, there were a continuous series of > ``` > [2020-03-03 16:56:31,374] WARN Resetting first dirty offset of FOO-3 to log > start offset 55649 since the checkpointed offset 0 is invalid. > (kafka.log.LogCleanerManager$) > ``` > messages (partition name changed, it wasn't really FOO). This was expected to > be resolved by KAFKA-6266 but clearly wasn't. > Further investigation revealed that a few segments were continuously > cleaning and generating messages in the `log-cleaner.log` of the form: > ``` > [2020-03-31 13:34:50,924] INFO Cleaner 1: Beginning cleaning of log FOO-3 > (kafka.log.LogCleaner) > [2020-03-31 13:34:50,924] INFO Cleaner 1: Building offset map for FOO-3... > (kafka.log.LogCleaner) > [2020-03-31 13:34:50,927] INFO Cleaner 1: Building offset map for log FOO-3 > for 0 segments in offset range [55287, 54237). (kafka.log.LogCleaner) > [2020-03-31 13:34:50,927] INFO Cleaner 1: Offset map for log FOO-3 complete. > (kafka.log.LogCleaner) > [2020-03-31 13:34:50,927] INFO Cleaner 1: Cleaning log FOO-3 (cleaning prior > to Wed Dec 31 19:00:00 EST 1969, discarding tombstones prior to Tue Dec 10 > 13:39:08 EST 2019)... (kafka.log.LogCleaner) > [2020-03-31 13:34:50,927] INFO [kafka-log-cleaner-thread-1]: Log cleaner > thread 1 cleaned log FOO-3 (dirty section = [55287, 55287]) > 0.0 MB of log processed in 0.0 seconds (0.0 MB/sec). > Indexed 0.0 MB in 0.0 seconds (0.0 Mb/sec, 100.0% of total time) > Buffer utilization: 0.0% > Cleaned 0.0 MB in 0.0 seconds (NaN Mb/sec, 0.0% of total time) > Start size: 0.0 MB (0 messages) > End size: 0.0 MB (0 messages) NaN% size reduction (NaN% fewer messages) > (kafka.log.LogCleaner) > ``` > What seems to have happened here (data determined for a different partition) > is: > There exist a number of partitions here which get relatively low traffic, > including our friend FOO-5. For whatever reason, LogStartOffset of this > partition has moved beyond the baseOffset of the active segment. (Notes in > other issues indicate that this is a reasonable scenario.) So there’s one > segment, starting at 166266, and a log start of 166301. > grabFilthiestCompactedLog runs and reads the checkpoint file. We see that > this topicpartition needs to be cleaned, and call cleanableOffsets on it > which returns an OffsetsToClean with firstDirtyOffset == logStartOffset == > 166301 and firstUncleanableOffset = max(logStart, activeSegment.baseOffset) = > 116301, and forceCheckpoint = true. > The checkpoint file is updated in grabFilthiestCompactedLog (this is the fix > for KAFKA-6266). We then create a LogToClean object based on the > firstDirtyOffset and firstUncleanableOffset of 166301 (past the active > segment’s base offset). > The LogToClean object has cleanBytes = logSegments(-1, > firstDirtyOffset).map(_.size).sum → the size of this segment. It has > firstUncleanableOffset and cleanableBytes determined by > calculateCleanableBytes. calculateCleanableBytes returns: > {{}} > {{val firstUncleanableSegment = > log.nonActiveLogSegmentsFrom(uncleanableOffset).headOption.getOrElse(log.activeSegment)}} > {{val firstUncleanableOffset = firstUncleanableSegment.baseOffset}} > {{val cleanableBytes = log.logSegments(firstDirtyOffset, > math.max(firstDirtyOffset, firstUncleanableOffset)).map(_.size.toLong).sum > (firstUncleanableOffset, cleanableBytes)}} > firstUncleanableSegment is activeSegment. firstUncleanableOffset is the base > offset, 166266. cleanableBytes is looking for logSegments(166301, max(166301, > 166266) → which _is the active segment_ > So there are “cleanableBytes” > 0. > We then filter out segments with totalbytes (clean + cleanable) > 0. This > segment has totalBytes > 0, and it has cleanablebytes, so great! It’s > filthiest. > The cleaner picks it, calls cleanLog on it, which then does
[jira] [Resolved] (KAFKA-9815) Consumer may never re-join if inconsistent metadata is received once
[ https://issues.apache.org/jira/browse/KAFKA-9815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Gustafson resolved KAFKA-9815. Fix Version/s: 2.4.2 2.5.0 Resolution: Fixed > Consumer may never re-join if inconsistent metadata is received once > > > Key: KAFKA-9815 > URL: https://issues.apache.org/jira/browse/KAFKA-9815 > Project: Kafka > Issue Type: Bug > Components: consumer >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram >Priority: Major > Fix For: 2.5.0, 2.4.2 > > > KAFKA-9797 is the result of an incorrect rolling upgrade test where a new > listener is added to brokers and set as the inter-broker listener within the > same rolling upgrade. As a result, metadata is inconsistent across brokers > until the rolling upgrade completes since interbroker communication is broken > until all brokers have the new listener. The test fails due to consumer > timeouts and sometimes this is because the upgrade takes longer than consumer > timeout. But several logs show an issue with the consumer when one metadata > response received during upgrade is different from the consumer's cached > `assignmentSnapshot`, triggering rebalance. > In > [https://github.com/apache/kafka/blob/7f640f13b4d486477035c3edb28466734f053beb/clients/src/main/java/org/apache/kafka/clients/consumer/internals/ConsumerCoordinator.java#L750,] > we return true for `rejoinNeededOrPending()` if `assignmentSnapshot` is not > the same as the current `metadataSnapshot`. We don't set `rejoinNeeded` in > the instance, but we revoke partitions and send JoinGroup request. If the > JoinGroup request fails and a subsequent metadata response contains the same > snapshot value as the previously cached `assignmentSnapshot`, we never send > `JoinGroup` again since snapshots match and `rejoinNeeded=false`. Partitions > are not assigned to the consumer after this and the test fails because > messages are not received. > Even though this particular system test isn't a valid upgrade scenario, we > should fix the consumer, since temporary metadata differences between brokers > can result in this scenario. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9815) Consumer may never re-join if inconsistent metadata is received once
[ https://issues.apache.org/jira/browse/KAFKA-9815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076764#comment-17076764 ] ASF GitHub Bot commented on KAFKA-9815: --- hachikuji commented on pull request #8420: KAFKA-9815; Ensure consumer always re-joins if JoinGroup fails URL: https://github.com/apache/kafka/pull/8420 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Consumer may never re-join if inconsistent metadata is received once > > > Key: KAFKA-9815 > URL: https://issues.apache.org/jira/browse/KAFKA-9815 > Project: Kafka > Issue Type: Bug > Components: consumer >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram >Priority: Major > > KAFKA-9797 is the result of an incorrect rolling upgrade test where a new > listener is added to brokers and set as the inter-broker listener within the > same rolling upgrade. As a result, metadata is inconsistent across brokers > until the rolling upgrade completes since interbroker communication is broken > until all brokers have the new listener. The test fails due to consumer > timeouts and sometimes this is because the upgrade takes longer than consumer > timeout. But several logs show an issue with the consumer when one metadata > response received during upgrade is different from the consumer's cached > `assignmentSnapshot`, triggering rebalance. > In > [https://github.com/apache/kafka/blob/7f640f13b4d486477035c3edb28466734f053beb/clients/src/main/java/org/apache/kafka/clients/consumer/internals/ConsumerCoordinator.java#L750,] > we return true for `rejoinNeededOrPending()` if `assignmentSnapshot` is not > the same as the current `metadataSnapshot`. We don't set `rejoinNeeded` in > the instance, but we revoke partitions and send JoinGroup request. If the > JoinGroup request fails and a subsequent metadata response contains the same > snapshot value as the previously cached `assignmentSnapshot`, we never send > `JoinGroup` again since snapshots match and `rejoinNeeded=false`. Partitions > are not assigned to the consumer after this and the test fails because > messages are not received. > Even though this particular system test isn't a valid upgrade scenario, we > should fix the consumer, since temporary metadata differences between brokers > can result in this scenario. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9697) ControlPlaneNetworkProcessorAvgIdlePercent is always NaN
[ https://issues.apache.org/jira/browse/KAFKA-9697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076763#comment-17076763 ] James Cheng commented on KAFKA-9697: Oh, no, we don't have that setup. I had assumed that the ControlPlaneNetworkProcessorAvgIdlePercent metric would be present, even if both the control plane and the data plane shared the same listeners. But apparently that is not the case? Is this purely a documentation issue then? Should we add text saying that the ControlPlaneNetworkProcessorAvgIdlePercent metric will be NaN if control.plane.listener.name is unset? (Related, I noticed that ControlPlaneNetworkProcessorAvgIdlePercent isn't documented on the website) Or, should we programmatically hide the metric if control.plane.listener.name is not yet set? Or, can we still calculate a value of ControlPlaneNetworkProcessorAvgIdlePercent when the control plane and data plane share a listener? > ControlPlaneNetworkProcessorAvgIdlePercent is always NaN > > > Key: KAFKA-9697 > URL: https://issues.apache.org/jira/browse/KAFKA-9697 > Project: Kafka > Issue Type: Improvement > Components: metrics >Affects Versions: 2.3.0 >Reporter: James Cheng >Priority: Major > > I have a broker running Kafka 2.3.0. The value of > kafka.network:type=SocketServer,name=ControlPlaneNetworkProcessorAvgIdlePercent > is always "NaN". > Is that normal, or is there a problem with the metric? > I am running Kafka 2.3.0. I have not checked this in newer/older versions. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-9826) Log cleaning goes in infinite loop when first dirty offset is past start of active segment
Steve Rodrigues created KAFKA-9826: -- Summary: Log cleaning goes in infinite loop when first dirty offset is past start of active segment Key: KAFKA-9826 URL: https://issues.apache.org/jira/browse/KAFKA-9826 Project: Kafka Issue Type: Bug Components: log cleaner Affects Versions: 2.4.1 Reporter: Steve Rodrigues Assignee: Steve Rodrigues Seen on a system where a given partition had a single segment, and for whatever reason (deleteRecords?), the logStartOffset was greater than the base segment of the log, there were a continuous series of ``` [2020-03-03 16:56:31,374] WARN Resetting first dirty offset of FOO-3 to log start offset 55649 since the checkpointed offset 0 is invalid. (kafka.log.LogCleanerManager$) ``` messages (partition name changed, it wasn't really FOO). This was expected to be resolved by KAFKA-6266 but clearly wasn't. Further investigation revealed that a few segments were continuously cleaning and generating messages in the `log-cleaner.log` of the form: ``` [2020-03-31 13:34:50,924] INFO Cleaner 1: Beginning cleaning of log FOO-3 (kafka.log.LogCleaner) [2020-03-31 13:34:50,924] INFO Cleaner 1: Building offset map for FOO-3... (kafka.log.LogCleaner) [2020-03-31 13:34:50,927] INFO Cleaner 1: Building offset map for log FOO-3 for 0 segments in offset range [55287, 54237). (kafka.log.LogCleaner) [2020-03-31 13:34:50,927] INFO Cleaner 1: Offset map for log FOO-3 complete. (kafka.log.LogCleaner) [2020-03-31 13:34:50,927] INFO Cleaner 1: Cleaning log FOO-3 (cleaning prior to Wed Dec 31 19:00:00 EST 1969, discarding tombstones prior to Tue Dec 10 13:39:08 EST 2019)... (kafka.log.LogCleaner) [2020-03-31 13:34:50,927] INFO [kafka-log-cleaner-thread-1]: Log cleaner thread 1 cleaned log FOO-3 (dirty section = [55287, 55287]) 0.0 MB of log processed in 0.0 seconds (0.0 MB/sec). Indexed 0.0 MB in 0.0 seconds (0.0 Mb/sec, 100.0% of total time) Buffer utilization: 0.0% Cleaned 0.0 MB in 0.0 seconds (NaN Mb/sec, 0.0% of total time) Start size: 0.0 MB (0 messages) End size: 0.0 MB (0 messages) NaN% size reduction (NaN% fewer messages) (kafka.log.LogCleaner) ``` What seems to have happened here (data determined for a different partition) is: There exist a number of partitions here which get relatively low traffic, including our friend FOO-5. For whatever reason, LogStartOffset of this partition has moved beyond the baseOffset of the active segment. (Notes in other issues indicate that this is a reasonable scenario.) So there’s one segment, starting at 166266, and a log start of 166301. grabFilthiestCompactedLog runs and reads the checkpoint file. We see that this topicpartition needs to be cleaned, and call cleanableOffsets on it which returns an OffsetsToClean with firstDirtyOffset == logStartOffset == 166301 and firstUncleanableOffset = max(logStart, activeSegment.baseOffset) = 116301, and forceCheckpoint = true. The checkpoint file is updated in grabFilthiestCompactedLog (this is the fix for KAFKA-6266). We then create a LogToClean object based on the firstDirtyOffset and firstUncleanableOffset of 166301 (past the active segment’s base offset). The LogToClean object has cleanBytes = logSegments(-1, firstDirtyOffset).map(_.size).sum → the size of this segment. It has firstUncleanableOffset and cleanableBytes determined by calculateCleanableBytes. calculateCleanableBytes returns: {{}} {{val firstUncleanableSegment = log.nonActiveLogSegmentsFrom(uncleanableOffset).headOption.getOrElse(log.activeSegment)}} {{val firstUncleanableOffset = firstUncleanableSegment.baseOffset}} {{val cleanableBytes = log.logSegments(firstDirtyOffset, math.max(firstDirtyOffset, firstUncleanableOffset)).map(_.size.toLong).sum (firstUncleanableOffset, cleanableBytes)}} firstUncleanableSegment is activeSegment. firstUncleanableOffset is the base offset, 166266. cleanableBytes is looking for logSegments(166301, max(166301, 166266) → which _is the active segment_ So there are “cleanableBytes” > 0. We then filter out segments with totalbytes (clean + cleanable) > 0. This segment has totalBytes > 0, and it has cleanablebytes, so great! It’s filthiest. The cleaner picks it, calls cleanLog on it, which then does cleaner.clean, which returns nextDirtyOffset and cleaner stats. cleaner.clean callls doClean, which builds an offsetMap. The offsetMap looks at non-active segments, when building, but there aren’t any. So the endOffset of the offsetMap is lastOffset (default -1) + 1 → 0. We record the stats (including logging to log-cleaner.log). After this we call cleanerManager.doneCleaning, which writes the checkpoint file with the latest value… of 0. And then the process starts all over. It appears that there's at least one bug here, where `log.logSegments(from, to)` will return an empty list if from == to and both are
[jira] [Commented] (KAFKA-9753) Add task-level active-process-ratio to Streams metrics
[ https://issues.apache.org/jira/browse/KAFKA-9753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076721#comment-17076721 ] ASF GitHub Bot commented on KAFKA-9753: --- guozhangwang commented on pull request #8371: KAFKA-9753: A few more metrics to add URL: https://github.com/apache/kafka/pull/8371 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add task-level active-process-ratio to Streams metrics > -- > > Key: KAFKA-9753 > URL: https://issues.apache.org/jira/browse/KAFKA-9753 > Project: Kafka > Issue Type: New Feature > Components: streams >Reporter: Guozhang Wang >Assignee: Guozhang Wang >Priority: Major > Fix For: 2.6.0 > > > This is described as part of KIP-444 (which is mostly done in 2.4 / 2.5). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-9751) Admin `FindCoordinator` call should go to controller instead of ZK
[ https://issues.apache.org/jira/browse/KAFKA-9751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Boyang Chen updated KAFKA-9751: --- Parent: (was: KAFKA-9119) Issue Type: Improvement (was: Sub-task) > Admin `FindCoordinator` call should go to controller instead of ZK > -- > > Key: KAFKA-9751 > URL: https://issues.apache.org/jira/browse/KAFKA-9751 > Project: Kafka > Issue Type: Improvement >Reporter: Boyang Chen >Priority: Major > > In current trunk, we are still going to use ZK for topic creation in the > routing of FindCoordinatorRequest: > val (partition, topicMetadata) = > CoordinatorType.forId(findCoordinatorRequest.data.keyType) match { > case CoordinatorType.GROUP => > val partition = > groupCoordinator.partitionFor(findCoordinatorRequest.data.key) > val metadata = getOrCreateInternalTopic(GROUP_METADATA_TOPIC_NAME, > request.context.listenerName) > (partition, metadata) > Which should be migrated to controller handling -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-9751) Admin `FindCoordinator` call should go to controller instead of ZK
[ https://issues.apache.org/jira/browse/KAFKA-9751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Boyang Chen updated KAFKA-9751: --- Parent: KAFKA-9705 Issue Type: Sub-task (was: Improvement) > Admin `FindCoordinator` call should go to controller instead of ZK > -- > > Key: KAFKA-9751 > URL: https://issues.apache.org/jira/browse/KAFKA-9751 > Project: Kafka > Issue Type: Sub-task >Reporter: Boyang Chen >Priority: Major > > In current trunk, we are still going to use ZK for topic creation in the > routing of FindCoordinatorRequest: > val (partition, topicMetadata) = > CoordinatorType.forId(findCoordinatorRequest.data.keyType) match { > case CoordinatorType.GROUP => > val partition = > groupCoordinator.partitionFor(findCoordinatorRequest.data.key) > val metadata = getOrCreateInternalTopic(GROUP_METADATA_TOPIC_NAME, > request.context.listenerName) > (partition, metadata) > Which should be migrated to controller handling -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9543) Consumer offset reset after new segment rolling
[ https://issues.apache.org/jira/browse/KAFKA-9543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076676#comment-17076676 ] Alexis Seigneurin commented on KAFKA-9543: -- We have been seeing the same issue here with Kafka 2.4.0. Here is an example: at 2020-04-05 06:54:15.746, the offsets recorded on the consumers were between 623860252 and 623869089, and we received a record with offset 405637478. These offsets and times coincide with logs seen on the brokers (see attached screenshot). !image-2020-04-06-17-10-32-636.png! > Consumer offset reset after new segment rolling > --- > > Key: KAFKA-9543 > URL: https://issues.apache.org/jira/browse/KAFKA-9543 > Project: Kafka > Issue Type: Bug >Affects Versions: 2.4.0 >Reporter: Rafał Boniecki >Priority: Major > Attachments: Untitled.png, image-2020-04-06-17-10-32-636.png > > > After upgrade from kafka 2.1.1 to 2.4.0, I'm experiencing unexpected consumer > offset resets. > Consumer: > {code:java} > 2020-02-12T11:12:58.402+01:00 hostname 4a2a39a35a02 > [2020-02-12T11:12:58,402][INFO > ][org.apache.kafka.clients.consumer.internals.Fetcher] [Consumer > clientId=logstash-1, groupId=logstash] Fetch offset 1632750575 is out of > range for partition stats-5, resetting offset > {code} > Broker: > {code:java} > 2020-02-12 11:12:58:400 CET INFO > [data-plane-kafka-request-handler-1][kafka.log.Log] [Log partition=stats-5, > dir=/kafka4/data] Rolled new log segment at offset 1632750565 in 2 ms.{code} > All resets are perfectly correlated to rolling new segments at the broker - > segment is rolled first, then, couple of ms later, reset on the consumer > occurs. Attached is grafana graph with consumer lag per partition. All sudden > spikes in lag are offset resets due to this bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-9543) Consumer offset reset after new segment rolling
[ https://issues.apache.org/jira/browse/KAFKA-9543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexis Seigneurin updated KAFKA-9543: - Attachment: image-2020-04-06-17-10-32-636.png > Consumer offset reset after new segment rolling > --- > > Key: KAFKA-9543 > URL: https://issues.apache.org/jira/browse/KAFKA-9543 > Project: Kafka > Issue Type: Bug >Affects Versions: 2.4.0 >Reporter: Rafał Boniecki >Priority: Major > Attachments: Untitled.png, image-2020-04-06-17-10-32-636.png > > > After upgrade from kafka 2.1.1 to 2.4.0, I'm experiencing unexpected consumer > offset resets. > Consumer: > {code:java} > 2020-02-12T11:12:58.402+01:00 hostname 4a2a39a35a02 > [2020-02-12T11:12:58,402][INFO > ][org.apache.kafka.clients.consumer.internals.Fetcher] [Consumer > clientId=logstash-1, groupId=logstash] Fetch offset 1632750575 is out of > range for partition stats-5, resetting offset > {code} > Broker: > {code:java} > 2020-02-12 11:12:58:400 CET INFO > [data-plane-kafka-request-handler-1][kafka.log.Log] [Log partition=stats-5, > dir=/kafka4/data] Rolled new log segment at offset 1632750565 in 2 ms.{code} > All resets are perfectly correlated to rolling new segments at the broker - > segment is rolled first, then, couple of ms later, reset on the consumer > occurs. Attached is grafana graph with consumer lag per partition. All sudden > spikes in lag are offset resets due to this bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9438) Issue with mm2 active/active replication
[ https://issues.apache.org/jira/browse/KAFKA-9438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076672#comment-17076672 ] Rens Groothuijsen commented on KAFKA-9438: -- [~romanius11] Tried to replicate this issue, though the only scenario in which I managed to do so was one where the MM instance accidentally looped back to the source cluster rather than writing to the target cluster. > Issue with mm2 active/active replication > > > Key: KAFKA-9438 > URL: https://issues.apache.org/jira/browse/KAFKA-9438 > Project: Kafka > Issue Type: Bug > Components: mirrormaker >Affects Versions: 2.4.0 >Reporter: Roman >Priority: Minor > > Hi, > > i am trying to configure the the active/active with new kafka 2.4.0 and MM2. > I have 2 kafka clusters of 3 kafkas, one in lets say BO and on in IN. > In each cluster there are 3 kafkas. > Topics are replicated properly so in BO i see > {quote}topics > in.topics > {quote} > > in IN i see > {quote}topics > bo.topic > {quote} > > That should be according to documentation. > > But when I stop the replication process on one data center and start it up, > the replication replicate the topics with the same prefix twice bo.bo.topics > or in.in.topics depending on what connector i restart. > I have also blacklisted the topics but they are still replicating. > > bo.properties file > {quote}name = in-bo > #topics = .* > topics.blacklist = "bo.*" > #groups = .* > connector.class = org.apache.kafka.connect.mirror.MirrorSourceConnector > tasks.max = 10 > source.cluster.alias = in > target.cluster.alias = bo > source.cluster.bootstrap.servers = > IN-kafka1:9092,IN-kafka2:9092,IN-kafka3:9092 > target.cluster.bootstrap.servers = > BO-kafka1:9092,BO-kafka2:9092,bo-kafka39092 > use ByteArrayConverter to ensure that records are not re-encoded > key.converter = org.apache.kafka.connect.converters.ByteArrayConverter > value.converter = org.apache.kafka.connect.converters.ByteArrayConverter > > {quote} > in.properties > {quote}name = bo-in > #topics = .* > topics.blacklist = "in.*" > #groups = .* > connector.class = org.apache.kafka.connect.mirror.MirrorSourceConnector > tasks.max = 10 > source.cluster.alias = bo > target.cluster.alias = in > target.cluster.bootstrap.servers = > IN-kafka1:9092,IN-kafka2:9092,IN-kafka3:9092 > source.cluster.bootstrap.servers = > BO-kafka1:9092,BO-kafka2:9092,BO-kafka3:9092 > use ByteArrayConverter to ensure that records are not re-encoded > key.converter = org.apache.kafka.connect.converters.ByteArrayConverter > value.converter = org.apache.kafka.connect.converters.ByteArrayConverter > {quote} > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-9825) Kafka protocol BNF format should have some way to display tagged fields
Colin McCabe created KAFKA-9825: --- Summary: Kafka protocol BNF format should have some way to display tagged fields Key: KAFKA-9825 URL: https://issues.apache.org/jira/browse/KAFKA-9825 Project: Kafka Issue Type: Improvement Reporter: Colin McCabe The Kafka protocol BNF format should have some way to display tagged fields. But in clients/src/main/java/org/apache/kafka/common/protocol/Protocol.java , there is no special treatment for fields with a tag. Maybe something like FIELD_NAME<1> (where 1= the tag number) would work. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-9825) Kafka protocol BNF format should have some way to display tagged fields
[ https://issues.apache.org/jira/browse/KAFKA-9825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin McCabe updated KAFKA-9825: Labels: newbie (was: ) > Kafka protocol BNF format should have some way to display tagged fields > --- > > Key: KAFKA-9825 > URL: https://issues.apache.org/jira/browse/KAFKA-9825 > Project: Kafka > Issue Type: Improvement >Reporter: Colin McCabe >Priority: Major > Labels: newbie > > The Kafka protocol BNF format should have some way to display tagged fields. > But in clients/src/main/java/org/apache/kafka/common/protocol/Protocol.java , > there is no special treatment for fields with a tag. Maybe something like > FIELD_NAME<1> (where 1= the tag number) would work. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9770) Caching State Store does not Close Underlying State Store When Exception is Thrown During Flushing
[ https://issues.apache.org/jira/browse/KAFKA-9770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076615#comment-17076615 ] John Roesler commented on KAFKA-9770: - Thanks, [~mjsax] ! > Caching State Store does not Close Underlying State Store When Exception is > Thrown During Flushing > -- > > Key: KAFKA-9770 > URL: https://issues.apache.org/jira/browse/KAFKA-9770 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 2.0.0 >Reporter: Bruno Cadonna >Assignee: Bruno Cadonna >Priority: Major > Fix For: 2.5.0 > > > When a caching state store is closed it calls its {{flush()}} method. If > {{flush()}} throws an exception the underlying state store is not closed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9770) Caching State Store does not Close Underlying State Store When Exception is Thrown During Flushing
[ https://issues.apache.org/jira/browse/KAFKA-9770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076614#comment-17076614 ] Matthias J. Sax commented on KAFKA-9770: The commit was cherry-picked to 2.5 and there will be a new RC. Updated the "fixed version" to 2.5 and resolved the ticket. > Caching State Store does not Close Underlying State Store When Exception is > Thrown During Flushing > -- > > Key: KAFKA-9770 > URL: https://issues.apache.org/jira/browse/KAFKA-9770 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 2.0.0 >Reporter: Bruno Cadonna >Assignee: Bruno Cadonna >Priority: Major > Fix For: 2.5.0 > > > When a caching state store is closed it calls its {{flush()}} method. If > {{flush()}} throws an exception the underlying state store is not closed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-9770) Caching State Store does not Close Underlying State Store When Exception is Thrown During Flushing
[ https://issues.apache.org/jira/browse/KAFKA-9770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias J. Sax updated KAFKA-9770: --- Fix Version/s: (was: 2.6.0) > Caching State Store does not Close Underlying State Store When Exception is > Thrown During Flushing > -- > > Key: KAFKA-9770 > URL: https://issues.apache.org/jira/browse/KAFKA-9770 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 2.0.0 >Reporter: Bruno Cadonna >Assignee: Bruno Cadonna >Priority: Major > Fix For: 2.5.0 > > > When a caching state store is closed it calls its {{flush()}} method. If > {{flush()}} throws an exception the underlying state store is not closed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9798) org.apache.kafka.streams.integration.QueryableStateIntegrationTest.shouldAllowConcurrentAccesses
[ https://issues.apache.org/jira/browse/KAFKA-9798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076606#comment-17076606 ] Matthias J. Sax commented on KAFKA-9798: [https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/1624/testReport/junit/org.apache.kafka.streams.integration/QueryableStateIntegrationTest/shouldAllowConcurrentAccesses/] Different error message (Seem the hotfix did not do? The PR should contain the hotfix \cc [~guozhang] ): {quote}java.nio.file.DirectoryNotEmptyException: /tmp/state-queryable-state-18623682413158663595/queryable-state-1/1_0 at sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:242) at sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103) at java.nio.file.Files.delete(Files.java:1126) at org.apache.kafka.common.utils.Utils$2.postVisitDirectory(Utils.java:802) at org.apache.kafka.common.utils.Utils$2.postVisitDirectory(Utils.java:772) at java.nio.file.Files.walkFileTree(Files.java:2688) at java.nio.file.Files.walkFileTree(Files.java:2742) at org.apache.kafka.common.utils.Utils.delete(Utils.java:772) at org.apache.kafka.common.utils.Utils.delete(Utils.java:758) at org.apache.kafka.streams.integration.utils.IntegrationTestUtils.purgeLocalStreamsState(IntegrationTestUtils.java:125) at org.apache.kafka.streams.integration.QueryableStateIntegrationTest.shutdown(QueryableStateIntegrationTest.java:225){quote} > org.apache.kafka.streams.integration.QueryableStateIntegrationTest.shouldAllowConcurrentAccesses > > > Key: KAFKA-9798 > URL: https://issues.apache.org/jira/browse/KAFKA-9798 > Project: Kafka > Issue Type: Test > Components: streams, unit tests >Reporter: Bill Bejeck >Priority: Major > Labels: flaky-test, test > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9697) ControlPlaneNetworkProcessorAvgIdlePercent is always NaN
[ https://issues.apache.org/jira/browse/KAFKA-9697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076538#comment-17076538 ] Chia-Ping Tsai commented on KAFKA-9697: --- Did you config control.plane.listener.name? If not, the value is always NaN. > ControlPlaneNetworkProcessorAvgIdlePercent is always NaN > > > Key: KAFKA-9697 > URL: https://issues.apache.org/jira/browse/KAFKA-9697 > Project: Kafka > Issue Type: Improvement > Components: metrics >Affects Versions: 2.3.0 >Reporter: James Cheng >Priority: Major > > I have a broker running Kafka 2.3.0. The value of > kafka.network:type=SocketServer,name=ControlPlaneNetworkProcessorAvgIdlePercent > is always "NaN". > Is that normal, or is there a problem with the metric? > I am running Kafka 2.3.0. I have not checked this in newer/older versions. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9824) Consumer loses partition offset and resets post 2.4.1 version upgrade
[ https://issues.apache.org/jira/browse/KAFKA-9824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076516#comment-17076516 ] Jason Gustafson commented on KAFKA-9824: [~nitay.k] Thanks for the report. From the logs, we saw that a segment was rolled at offset 1222791065, which is just after the out of range error at offset 1222791071 (if we trust timestamps). It would be useful to know if there was a leader change around this time. The messages about the coordinator unloading suggests that there might have been. Could you search for these two messages in the logs for partition trackingSolutionAttribution-48? {code} info(s"$topicPartition starts at Leader Epoch ${partitionStateInfo.basePartitionState.leaderEpoch} from " + s"offset $leaderEpochStartOffset. Previous Leader Epoch was: $leaderEpoch") {code} and {code} info(s"Truncating to offset $targetOffset") {code} > Consumer loses partition offset and resets post 2.4.1 version upgrade > - > > Key: KAFKA-9824 > URL: https://issues.apache.org/jira/browse/KAFKA-9824 > Project: Kafka > Issue Type: Bug >Affects Versions: 2.4.1 >Reporter: Nitay Kufert >Priority: Major > Attachments: image-2020-04-06-13-14-47-014.png > > > Hello, > around 2 weeks ago we upgraded our Kafka clients & brokers to 2.4.1 (from > 2.3.1), > and we started noticing a troubling behavior that we didn't see before: > > Without apparent reason, a specific partition on a specific consumer loses > its offset and start re-consuming the entire partition from the beginning > (according to the retention). > > Messages appearing on the consumer (client): > {quote}Apr 5, 2020 @ 14:54:47.327 INFO sonic-fire-attribution [Consumer > clientId=consumer-fireAttributionConsumerGroup4-2, > groupId=fireAttributionConsumerGroup4] Resetting offset for partition > trackingSolutionAttribution-48 to offset 1216430527. > {quote} > {quote}Apr 5, 2020 @ 14:54:46.797 INFO sonic-fire-attribution [Consumer > clientId=consumer-fireAttributionConsumerGroup4-2, > groupId=fireAttributionConsumerGroup4] Fetch offset 1222791071 is out of > range for partition trackingSolutionAttribution-48 > {quote} > Those are the logs from the brokers at the same time (searched for > "trackingSolutionAttribution-48" OR "fireAttributionConsumerGroup4") > {quote}Apr 5, 2020 @ 14:54:46.801 INFO Writing producer snapshot at offset > 1222791065 > > Apr 5, 2020 @ 14:54:46.801 INFO Writing producer snapshot at offset > 1222791065 > > Apr 5, 2020 @ 14:54:46.801 INFO Rolled new log segment at offset 1222791065 > in 0 ms. > > Apr 5, 2020 @ 14:54:04.400 INFO BrokerId 1033 is no longer a coordinator for > the group fireAttributionConsumerGroup4. Proceeding cleanup for other alive > groups > > Apr 5, 2020 @ 14:54:04.400 INFO BrokerId 1033 is no longer a coordinator for > the group fireAttributionConsumerGroup4. Proceeding cleanup for other alive > groups > {quote} > Another way to see the same thing, from our monitoring (DD) on the partition > offset: > !image-2020-04-06-13-14-47-014.png|width=530,height=152! > The recovery you are seeing is after I run partition offset reset manually > (using kafka-consumer-groups.sh --bootstrap-server localhost:9092 --topic > trackingSolutionAttribution:57 --group fireAttributionConsumerGroup4 > --reset-offsets --to-datetime 'SOME DATE') > > Any idea what can be causing this? we have it happen to us at least 5 times > since the upgrade, and before that, I don't remember it ever happening to us. > > Topic config is set to default, except the retention, which is manually set > to 4320. > The topic has 60 partitions & a replication factor of 2. > > Consumer config: > {code:java} > ConsumerConfig values: > allow.auto.create.topics = true > auto.commit.interval.ms = 5000 > auto.offset.reset = earliest > bootstrap.servers = [..] > check.crcs = true > client.dns.lookup = default > client.id = > client.rack = > connections.max.idle.ms = 54 > default.api.timeout.ms = 6 > enable.auto.commit = true > exclude.internal.topics = true > fetch.max.bytes = 52428800 > fetch.max.wait.ms = 500 > fetch.min.bytes = 1 > group.id = fireAttributionConsumerGroup4 > group.instance.id = null > heartbeat.interval.ms = 1 > interceptor.classes = [] > internal.leave.group.on.close = true > isolation.level = read_uncommitted > key.deserializer = class > org.apache.kafka.common.serialization.ByteArrayDeserializer > max.partition.fetch.bytes = 1048576 > max.poll.interval.ms = 30 > max.poll.records = 500 > metadata.max.age.ms = 30 >
[jira] [Commented] (KAFKA-9824) Consumer loses partition offset and resets post 2.4.1 version upgrade
[ https://issues.apache.org/jira/browse/KAFKA-9824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076491#comment-17076491 ] Boyang Chen commented on KAFKA-9824: Thanks for reporting. Does the lost partition happen after each rebalance? > Consumer loses partition offset and resets post 2.4.1 version upgrade > - > > Key: KAFKA-9824 > URL: https://issues.apache.org/jira/browse/KAFKA-9824 > Project: Kafka > Issue Type: Bug >Affects Versions: 2.4.1 >Reporter: Nitay Kufert >Priority: Major > Attachments: image-2020-04-06-13-14-47-014.png > > > Hello, > around 2 weeks ago we upgraded our Kafka clients & brokers to 2.4.1 (from > 2.3.1), > and we started noticing a troubling behavior that we didn't see before: > > Without apparent reason, a specific partition on a specific consumer loses > its offset and start re-consuming the entire partition from the beginning > (according to the retention). > > Messages appearing on the consumer (client): > {quote}Apr 5, 2020 @ 14:54:47.327 INFO sonic-fire-attribution [Consumer > clientId=consumer-fireAttributionConsumerGroup4-2, > groupId=fireAttributionConsumerGroup4] Resetting offset for partition > trackingSolutionAttribution-48 to offset 1216430527. > {quote} > {quote}Apr 5, 2020 @ 14:54:46.797 INFO sonic-fire-attribution [Consumer > clientId=consumer-fireAttributionConsumerGroup4-2, > groupId=fireAttributionConsumerGroup4] Fetch offset 1222791071 is out of > range for partition trackingSolutionAttribution-48 > {quote} > Those are the logs from the brokers at the same time (searched for > "trackingSolutionAttribution-48" OR "fireAttributionConsumerGroup4") > {quote}Apr 5, 2020 @ 14:54:46.801 INFO Writing producer snapshot at offset > 1222791065 > > Apr 5, 2020 @ 14:54:46.801 INFO Writing producer snapshot at offset > 1222791065 > > Apr 5, 2020 @ 14:54:46.801 INFO Rolled new log segment at offset 1222791065 > in 0 ms. > > Apr 5, 2020 @ 14:54:04.400 INFO BrokerId 1033 is no longer a coordinator for > the group fireAttributionConsumerGroup4. Proceeding cleanup for other alive > groups > > Apr 5, 2020 @ 14:54:04.400 INFO BrokerId 1033 is no longer a coordinator for > the group fireAttributionConsumerGroup4. Proceeding cleanup for other alive > groups > {quote} > Another way to see the same thing, from our monitoring (DD) on the partition > offset: > !image-2020-04-06-13-14-47-014.png|width=530,height=152! > The recovery you are seeing is after I run partition offset reset manually > (using kafka-consumer-groups.sh --bootstrap-server localhost:9092 --topic > trackingSolutionAttribution:57 --group fireAttributionConsumerGroup4 > --reset-offsets --to-datetime 'SOME DATE') > > Any idea what can be causing this? we have it happen to us at least 5 times > since the upgrade, and before that, I don't remember it ever happening to us. > > Topic config is set to default, except the retention, which is manually set > to 4320. > The topic has 60 partitions & a replication factor of 2. > > Consumer config: > {code:java} > ConsumerConfig values: > allow.auto.create.topics = true > auto.commit.interval.ms = 5000 > auto.offset.reset = earliest > bootstrap.servers = [..] > check.crcs = true > client.dns.lookup = default > client.id = > client.rack = > connections.max.idle.ms = 54 > default.api.timeout.ms = 6 > enable.auto.commit = true > exclude.internal.topics = true > fetch.max.bytes = 52428800 > fetch.max.wait.ms = 500 > fetch.min.bytes = 1 > group.id = fireAttributionConsumerGroup4 > group.instance.id = null > heartbeat.interval.ms = 1 > interceptor.classes = [] > internal.leave.group.on.close = true > isolation.level = read_uncommitted > key.deserializer = class > org.apache.kafka.common.serialization.ByteArrayDeserializer > max.partition.fetch.bytes = 1048576 > max.poll.interval.ms = 30 > max.poll.records = 500 > metadata.max.age.ms = 30 > metric.reporters = [] > metrics.num.samples = 2 > metrics.recording.level = INFO > metrics.sample.window.ms = 3 > partition.assignment.strategy = [class > org.apache.kafka.clients.consumer.RangeAssignor] > receive.buffer.bytes = 65536 > reconnect.backoff.max.ms = 1000 > reconnect.backoff.ms = 50 > request.timeout.ms = 3 > retry.backoff.ms = 100 > sasl.client.callback.handler.class = null > sasl.jaas.config = null > sasl.kerberos.kinit.cmd = /usr/bin/kinit > sasl.kerberos.min.time.before.relogin = 6 > sasl.kerberos.service.name = null > sasl.kerberos.ticket.renew.jitter =
[jira] [Commented] (KAFKA-3824) Docs indicate auto.commit breaks at least once delivery but that is incorrect
[ https://issues.apache.org/jira/browse/KAFKA-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076359#comment-17076359 ] ASF GitHub Bot commented on KAFKA-3824: --- amanullah92 commented on pull request #8194: KAFKA-3824 | MINOR | Trying to make autocommit behavior clear in the config s… URL: https://github.com/apache/kafka/pull/8194 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Docs indicate auto.commit breaks at least once delivery but that is incorrect > - > > Key: KAFKA-3824 > URL: https://issues.apache.org/jira/browse/KAFKA-3824 > Project: Kafka > Issue Type: Improvement > Components: consumer >Affects Versions: 0.10.0.0 >Reporter: Jay Kreps >Assignee: Jason Gustafson >Priority: Major > Labels: newbie > Fix For: 0.10.1.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > The javadocs for the new consumer indicate that auto commit breaks at least > once delivery. This is no longer correct as of 0.10. > http://kafka.apache.org/0100/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-9824) Consumer loses partition offset and resets post 2.4.1 version upgrade
[ https://issues.apache.org/jira/browse/KAFKA-9824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nitay Kufert updated KAFKA-9824: Description: Hello, around 2 weeks ago we upgraded our Kafka clients & brokers to 2.4.1 (from 2.3.1), and we started noticing a troubling behavior that we didn't see before: Without apparent reason, a specific partition on a specific consumer loses its offset and start re-consuming the entire partition from the beginning (according to the retention). Messages appearing on the consumer (client): {quote}Apr 5, 2020 @ 14:54:47.327 INFO sonic-fire-attribution [Consumer clientId=consumer-fireAttributionConsumerGroup4-2, groupId=fireAttributionConsumerGroup4] Resetting offset for partition trackingSolutionAttribution-48 to offset 1216430527. {quote} {quote}Apr 5, 2020 @ 14:54:46.797 INFO sonic-fire-attribution [Consumer clientId=consumer-fireAttributionConsumerGroup4-2, groupId=fireAttributionConsumerGroup4] Fetch offset 1222791071 is out of range for partition trackingSolutionAttribution-48 {quote} Those are the logs from the brokers at the same time (searched for "trackingSolutionAttribution-48" OR "fireAttributionConsumerGroup4") {quote}Apr 5, 2020 @ 14:54:46.801 INFO Writing producer snapshot at offset 1222791065 Apr 5, 2020 @ 14:54:46.801 INFO Writing producer snapshot at offset 1222791065 Apr 5, 2020 @ 14:54:46.801 INFO Rolled new log segment at offset 1222791065 in 0 ms. Apr 5, 2020 @ 14:54:04.400 INFO BrokerId 1033 is no longer a coordinator for the group fireAttributionConsumerGroup4. Proceeding cleanup for other alive groups Apr 5, 2020 @ 14:54:04.400 INFO BrokerId 1033 is no longer a coordinator for the group fireAttributionConsumerGroup4. Proceeding cleanup for other alive groups {quote} Another way to see the same thing, from our monitoring (DD) on the partition offset: !image-2020-04-06-13-14-47-014.png|width=530,height=152! The recovery you are seeing is after I run partition offset reset manually (using kafka-consumer-groups.sh --bootstrap-server localhost:9092 --topic trackingSolutionAttribution:57 --group fireAttributionConsumerGroup4 --reset-offsets --to-datetime 'SOME DATE') Any idea what can be causing this? we have it happen to us at least 5 times since the upgrade, and before that, I don't remember it ever happening to us. Topic config is set to default, except the retention, which is manually set to 4320. The topic has 60 partitions & a replication factor of 2. Consumer config: {code:java} ConsumerConfig values: allow.auto.create.topics = true auto.commit.interval.ms = 5000 auto.offset.reset = earliest bootstrap.servers = [..] check.crcs = true client.dns.lookup = default client.id = client.rack = connections.max.idle.ms = 54 default.api.timeout.ms = 6 enable.auto.commit = true exclude.internal.topics = true fetch.max.bytes = 52428800 fetch.max.wait.ms = 500 fetch.min.bytes = 1 group.id = fireAttributionConsumerGroup4 group.instance.id = null heartbeat.interval.ms = 1 interceptor.classes = [] internal.leave.group.on.close = true isolation.level = read_uncommitted key.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer max.partition.fetch.bytes = 1048576 max.poll.interval.ms = 30 max.poll.records = 500 metadata.max.age.ms = 30 metric.reporters = [] metrics.num.samples = 2 metrics.recording.level = INFO metrics.sample.window.ms = 3 partition.assignment.strategy = [class org.apache.kafka.clients.consumer.RangeAssignor] receive.buffer.bytes = 65536 reconnect.backoff.max.ms = 1000 reconnect.backoff.ms = 50 request.timeout.ms = 3 retry.backoff.ms = 100 sasl.client.callback.handler.class = null sasl.jaas.config = null sasl.kerberos.kinit.cmd = /usr/bin/kinit sasl.kerberos.min.time.before.relogin = 6 sasl.kerberos.service.name = null sasl.kerberos.ticket.renew.jitter = 0.05 sasl.kerberos.ticket.renew.window.factor = 0.8 sasl.login.callback.handler.class = null sasl.login.class = null sasl.login.refresh.buffer.seconds = 300 sasl.login.refresh.min.period.seconds = 60 sasl.login.refresh.window.factor = 0.8 sasl.login.refresh.window.jitter = 0.05 sasl.mechanism = GSSAPI security.protocol = PLAINTEXT security.providers = null send.buffer.bytes = 131072 session.timeout.ms = 3 ssl.cipher.suites = null ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1] ssl.endpoint.identification.algorithm = https
[jira] [Updated] (KAFKA-9824) Consumer loses partition offset and resets post 2.4.1 version upgrade
[ https://issues.apache.org/jira/browse/KAFKA-9824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nitay Kufert updated KAFKA-9824: Attachment: image-2020-04-06-13-14-47-014.png > Consumer loses partition offset and resets post 2.4.1 version upgrade > - > > Key: KAFKA-9824 > URL: https://issues.apache.org/jira/browse/KAFKA-9824 > Project: Kafka > Issue Type: Bug >Affects Versions: 2.4.1 >Reporter: Nitay Kufert >Priority: Major > Attachments: image-2020-04-06-13-14-47-014.png > > > Hello, > around 2 weeks ago we upgraded our Kafka clients & brokers to 2.4.1 (from > 2.3.1), > and we started noticing a troubling behavior that we didn't see before: > > Without apparent reason, a specific partition on a specific consumer loses > its offset and start re-consuming the entire partition from the beginning > (according to the retention). > > Messages appearing on the consumer (client): > {quote}Apr 5, 2020 @ 14:54:47.327 INFO sonic-fire-attribution [Consumer > clientId=consumer-fireAttributionConsumerGroup4-2, > groupId=fireAttributionConsumerGroup4] Resetting offset for partition > trackingSolutionAttribution-48 to offset 1216430527.{quote} > {quote}Apr 5, 2020 @ 14:54:46.797 INFO sonic-fire-attribution [Consumer > clientId=consumer-fireAttributionConsumerGroup4-2, > groupId=fireAttributionConsumerGroup4] Fetch offset 1222791071 is out of > range for partition trackingSolutionAttribution-48{quote} > Those are the logs from the brokers at the same time (searched for > "trackingSolutionAttribution-48" OR "fireAttributionConsumerGroup4") > {quote}Apr 5, 2020 @ 14:54:46.801 INFO Writing producer snapshot at offset > 1222791065 > > Apr 5, 2020 @ 14:54:46.801 INFO Writing producer snapshot at offset 1222791065 > > Apr 5, 2020 @ 14:54:46.801 INFO Rolled new log segment at offset 1222791065 > in 0 ms. > > Apr 5, 2020 @ 14:54:04.400 INFO BrokerId 1033 is no longer a coordinator for > the group fireAttributionConsumerGroup4. Proceeding cleanup for other alive > groups > > Apr 5, 2020 @ 14:54:04.400 INFO BrokerId 1033 is no longer a coordinator for > the group fireAttributionConsumerGroup4. Proceeding cleanup for other alive > groups{quote} > Another way to see the same thing, from our monitoring (DD) on the partition > offset: > !https://mail.google.com/mail/u/0?ui=2=8f9b1ec48a=0.1=msg-a:r-1797075611836356806=1714b5531deb08d9=fimg=s0-l75-ft=ANGjdJ_EnC23byd8TemOhOsmVTdpfogSBTeh45zFq4EVB1OvXnSZeLO0yepieyKAm8OoIarKz6qGYuKh9Pp2Ck7CmUFvZj4LljcYfzbmsdMF3LYaN93F6aUIH0l1bMA=emb=ii_k8n355r50|width=542,height=158! > The recovery you are seeing is after I run partition offset reset manually > (using kafka-consumer-groups.sh --bootstrap-server localhost:9092 --topic > trackingSolutionAttribution:57 --group fireAttributionConsumerGroup4 > --reset-offsets --to-datetime 'SOME DATE') > > Any idea what can be causing this? we have it happen to us at least 5 times > since the upgrade, and before that, I don't remember it ever happening to us. > > Topic config is set to default, except the retention, which is manually set > to 4320. > The topic has 60 partitions & a replication factor of 2. > > Consumer config: > {code:java} > ConsumerConfig values: > allow.auto.create.topics = true > auto.commit.interval.ms = 5000 > auto.offset.reset = earliest > bootstrap.servers = [..] > check.crcs = true > client.dns.lookup = default > client.id = > client.rack = > connections.max.idle.ms = 54 > default.api.timeout.ms = 6 > enable.auto.commit = true > exclude.internal.topics = true > fetch.max.bytes = 52428800 > fetch.max.wait.ms = 500 > fetch.min.bytes = 1 > group.id = fireAttributionConsumerGroup4 > group.instance.id = null > heartbeat.interval.ms = 1 > interceptor.classes = [] > internal.leave.group.on.close = true > isolation.level = read_uncommitted > key.deserializer = class > org.apache.kafka.common.serialization.ByteArrayDeserializer > max.partition.fetch.bytes = 1048576 > max.poll.interval.ms = 30 > max.poll.records = 500 > metadata.max.age.ms = 30 > metric.reporters = [] > metrics.num.samples = 2 > metrics.recording.level = INFO > metrics.sample.window.ms = 3 > partition.assignment.strategy = [class > org.apache.kafka.clients.consumer.RangeAssignor] > receive.buffer.bytes = 65536 > reconnect.backoff.max.ms = 1000 > reconnect.backoff.ms = 50 > request.timeout.ms = 3 > retry.backoff.ms = 100 > sasl.client.callback.handler.class = null > sasl.jaas.config = null > sasl.kerberos.kinit.cmd = /usr/bin/kinit >
[jira] [Updated] (KAFKA-9824) Consumer loses partition offset and resets post 2.4.1 version upgrade
[ https://issues.apache.org/jira/browse/KAFKA-9824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nitay Kufert updated KAFKA-9824: Description: Hello, around 2 weeks ago we upgraded our Kafka clients & brokers to 2.4.1 (from 2.3.1), and we started noticing a troubling behavior that we didn't see before: Without apparent reason, a specific partition on a specific consumer loses its offset and start re-consuming the entire partition from the beginning (according to the retention). Messages appearing on the consumer (client): {quote}Apr 5, 2020 @ 14:54:47.327 INFO sonic-fire-attribution [Consumer clientId=consumer-fireAttributionConsumerGroup4-2, groupId=fireAttributionConsumerGroup4] Resetting offset for partition trackingSolutionAttribution-48 to offset 1216430527. {quote} {quote}Apr 5, 2020 @ 14:54:46.797 INFO sonic-fire-attribution [Consumer clientId=consumer-fireAttributionConsumerGroup4-2, groupId=fireAttributionConsumerGroup4] Fetch offset 1222791071 is out of range for partition trackingSolutionAttribution-48 {quote} Those are the logs from the brokers at the same time (searched for "trackingSolutionAttribution-48" OR "fireAttributionConsumerGroup4") {quote}Apr 5, 2020 @ 14:54:46.801 INFO Writing producer snapshot at offset 1222791065 Apr 5, 2020 @ 14:54:46.801 INFO Writing producer snapshot at offset 1222791065 Apr 5, 2020 @ 14:54:46.801 INFO Rolled new log segment at offset 1222791065 in 0 ms. Apr 5, 2020 @ 14:54:04.400 INFO BrokerId 1033 is no longer a coordinator for the group fireAttributionConsumerGroup4. Proceeding cleanup for other alive groups Apr 5, 2020 @ 14:54:04.400 INFO BrokerId 1033 is no longer a coordinator for the group fireAttributionConsumerGroup4. Proceeding cleanup for other alive groups {quote} Another way to see the same thing, from our monitoring (DD) on the partition offset: The recovery you !image-2020-04-06-13-14-47-014.png! are seeing is after I run partition offset reset manually (using kafka-consumer-groups.sh --bootstrap-server localhost:9092 --topic trackingSolutionAttribution:57 --group fireAttributionConsumerGroup4 --reset-offsets --to-datetime 'SOME DATE') Any idea what can be causing this? we have it happen to us at least 5 times since the upgrade, and before that, I don't remember it ever happening to us. Topic config is set to default, except the retention, which is manually set to 4320. The topic has 60 partitions & a replication factor of 2. Consumer config: {code:java} ConsumerConfig values: allow.auto.create.topics = true auto.commit.interval.ms = 5000 auto.offset.reset = earliest bootstrap.servers = [..] check.crcs = true client.dns.lookup = default client.id = client.rack = connections.max.idle.ms = 54 default.api.timeout.ms = 6 enable.auto.commit = true exclude.internal.topics = true fetch.max.bytes = 52428800 fetch.max.wait.ms = 500 fetch.min.bytes = 1 group.id = fireAttributionConsumerGroup4 group.instance.id = null heartbeat.interval.ms = 1 interceptor.classes = [] internal.leave.group.on.close = true isolation.level = read_uncommitted key.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer max.partition.fetch.bytes = 1048576 max.poll.interval.ms = 30 max.poll.records = 500 metadata.max.age.ms = 30 metric.reporters = [] metrics.num.samples = 2 metrics.recording.level = INFO metrics.sample.window.ms = 3 partition.assignment.strategy = [class org.apache.kafka.clients.consumer.RangeAssignor] receive.buffer.bytes = 65536 reconnect.backoff.max.ms = 1000 reconnect.backoff.ms = 50 request.timeout.ms = 3 retry.backoff.ms = 100 sasl.client.callback.handler.class = null sasl.jaas.config = null sasl.kerberos.kinit.cmd = /usr/bin/kinit sasl.kerberos.min.time.before.relogin = 6 sasl.kerberos.service.name = null sasl.kerberos.ticket.renew.jitter = 0.05 sasl.kerberos.ticket.renew.window.factor = 0.8 sasl.login.callback.handler.class = null sasl.login.class = null sasl.login.refresh.buffer.seconds = 300 sasl.login.refresh.min.period.seconds = 60 sasl.login.refresh.window.factor = 0.8 sasl.login.refresh.window.jitter = 0.05 sasl.mechanism = GSSAPI security.protocol = PLAINTEXT security.providers = null send.buffer.bytes = 131072 session.timeout.ms = 3 ssl.cipher.suites = null ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1] ssl.endpoint.identification.algorithm = https ssl.key.password = null
[jira] [Created] (KAFKA-9824) Consumer loses partition offset and resets post 2.4.1 version upgrade
Nitay Kufert created KAFKA-9824: --- Summary: Consumer loses partition offset and resets post 2.4.1 version upgrade Key: KAFKA-9824 URL: https://issues.apache.org/jira/browse/KAFKA-9824 Project: Kafka Issue Type: Bug Affects Versions: 2.4.1 Reporter: Nitay Kufert Hello, around 2 weeks ago we upgraded our Kafka clients & brokers to 2.4.1 (from 2.3.1), and we started noticing a troubling behavior that we didn't see before: Without apparent reason, a specific partition on a specific consumer loses its offset and start re-consuming the entire partition from the beginning (according to the retention). Messages appearing on the consumer (client): {quote}Apr 5, 2020 @ 14:54:47.327 INFO sonic-fire-attribution [Consumer clientId=consumer-fireAttributionConsumerGroup4-2, groupId=fireAttributionConsumerGroup4] Resetting offset for partition trackingSolutionAttribution-48 to offset 1216430527.{quote} {quote}Apr 5, 2020 @ 14:54:46.797 INFO sonic-fire-attribution [Consumer clientId=consumer-fireAttributionConsumerGroup4-2, groupId=fireAttributionConsumerGroup4] Fetch offset 1222791071 is out of range for partition trackingSolutionAttribution-48{quote} Those are the logs from the brokers at the same time (searched for "trackingSolutionAttribution-48" OR "fireAttributionConsumerGroup4") {quote}Apr 5, 2020 @ 14:54:46.801 INFO Writing producer snapshot at offset 1222791065 Apr 5, 2020 @ 14:54:46.801 INFO Writing producer snapshot at offset 1222791065 Apr 5, 2020 @ 14:54:46.801 INFO Rolled new log segment at offset 1222791065 in 0 ms. Apr 5, 2020 @ 14:54:04.400 INFO BrokerId 1033 is no longer a coordinator for the group fireAttributionConsumerGroup4. Proceeding cleanup for other alive groups Apr 5, 2020 @ 14:54:04.400 INFO BrokerId 1033 is no longer a coordinator for the group fireAttributionConsumerGroup4. Proceeding cleanup for other alive groups{quote} Another way to see the same thing, from our monitoring (DD) on the partition offset: !https://mail.google.com/mail/u/0?ui=2=8f9b1ec48a=0.1=msg-a:r-1797075611836356806=1714b5531deb08d9=fimg=s0-l75-ft=ANGjdJ_EnC23byd8TemOhOsmVTdpfogSBTeh45zFq4EVB1OvXnSZeLO0yepieyKAm8OoIarKz6qGYuKh9Pp2Ck7CmUFvZj4LljcYfzbmsdMF3LYaN93F6aUIH0l1bMA=emb=ii_k8n355r50|width=542,height=158! The recovery you are seeing is after I run partition offset reset manually (using kafka-consumer-groups.sh --bootstrap-server localhost:9092 --topic trackingSolutionAttribution:57 --group fireAttributionConsumerGroup4 --reset-offsets --to-datetime 'SOME DATE') Any idea what can be causing this? we have it happen to us at least 5 times since the upgrade, and before that, I don't remember it ever happening to us. Topic config is set to default, except the retention, which is manually set to 4320. The topic has 60 partitions & a replication factor of 2. Consumer config: {code:java} ConsumerConfig values: allow.auto.create.topics = true auto.commit.interval.ms = 5000 auto.offset.reset = earliest bootstrap.servers = [..] check.crcs = true client.dns.lookup = default client.id = client.rack = connections.max.idle.ms = 54 default.api.timeout.ms = 6 enable.auto.commit = true exclude.internal.topics = true fetch.max.bytes = 52428800 fetch.max.wait.ms = 500 fetch.min.bytes = 1 group.id = fireAttributionConsumerGroup4 group.instance.id = null heartbeat.interval.ms = 1 interceptor.classes = [] internal.leave.group.on.close = true isolation.level = read_uncommitted key.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer max.partition.fetch.bytes = 1048576 max.poll.interval.ms = 30 max.poll.records = 500 metadata.max.age.ms = 30 metric.reporters = [] metrics.num.samples = 2 metrics.recording.level = INFO metrics.sample.window.ms = 3 partition.assignment.strategy = [class org.apache.kafka.clients.consumer.RangeAssignor] receive.buffer.bytes = 65536 reconnect.backoff.max.ms = 1000 reconnect.backoff.ms = 50 request.timeout.ms = 3 retry.backoff.ms = 100 sasl.client.callback.handler.class = null sasl.jaas.config = null sasl.kerberos.kinit.cmd = /usr/bin/kinit sasl.kerberos.min.time.before.relogin = 6 sasl.kerberos.service.name = null sasl.kerberos.ticket.renew.jitter = 0.05 sasl.kerberos.ticket.renew.window.factor = 0.8 sasl.login.callback.handler.class = null sasl.login.class = null sasl.login.refresh.buffer.seconds = 300 sasl.login.refresh.min.period.seconds = 60 sasl.login.refresh.window.factor = 0.8
[jira] [Commented] (KAFKA-4084) automated leader rebalance causes replication downtime for clusters with too many partitions
[ https://issues.apache.org/jira/browse/KAFKA-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076167#comment-17076167 ] Evan Williams commented on KAFKA-4084: -- [~sql_consulting] Many thanks for that details info. Sounds like some great work there, and very handy features. On another somewhat related note (and I can open a bug ticket for this if need be) - I've noticed than on a cluster (5.4), with 6000 topics - manual leader election times out. Is there a way to increase the timeout ? If not, then our only option is auto.leader.rebalance.enable=true. I guess it's important PLE works, for all of this functionality to work properly. *kafka-leader-election --bootstrap-server $(grep advertised.listeners= /etc/kafka/server.properties |cut -d: -f4 |cut -d/ -f3):9092 --all-topic-partitions --election-type preferred* Timeout waiting for election results Exception in thread "main" kafka.common.AdminCommandFailedException: Timeout waiting for election results at kafka.admin.LeaderElectionCommand$.electLeaders(LeaderElectionCommand.scala:133) at kafka.admin.LeaderElectionCommand$.run(LeaderElectionCommand.scala:88) at kafka.admin.LeaderElectionCommand$.main(LeaderElectionCommand.scala:41) at kafka.admin.LeaderElectionCommand.main(LeaderElectionCommand.scala) Caused by: org.apache.kafka.common.errors.TimeoutException: Aborted due to timeout. > automated leader rebalance causes replication downtime for clusters with too > many partitions > > > Key: KAFKA-4084 > URL: https://issues.apache.org/jira/browse/KAFKA-4084 > Project: Kafka > Issue Type: Bug > Components: controller >Affects Versions: 0.8.2.2, 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1 >Reporter: Tom Crayford >Priority: Major > Labels: reliability > Fix For: 1.1.0 > > > If you enable {{auto.leader.rebalance.enable}} (which is on by default), and > you have a cluster with many partitions, there is a severe amount of > replication downtime following a restart. This causes > `UnderReplicatedPartitions` to fire, and replication is paused. > This is because the current automated leader rebalance mechanism changes > leaders for *all* imbalanced partitions at once, instead of doing it > gradually. This effectively stops all replica fetchers in the cluster > (assuming there are enough imbalanced partitions), and restarts them. This > can take minutes on busy clusters, during which no replication is happening > and user data is at risk. Clients with {{acks=-1}} also see issues at this > time, because replication is effectively stalled. > To quote Todd Palino from the mailing list: > bq. There is an admin CLI command to trigger the preferred replica election > manually. There is also a broker configuration “auto.leader.rebalance.enable” > which you can set to have the broker automatically perform the PLE when > needed. DO NOT USE THIS OPTION. There are serious performance issues when > doing so, especially on larger clusters. It needs some development work that > has not been fully identified yet. > This setting is extremely useful for smaller clusters, but with high > partition counts causes the huge issues stated above. > One potential fix could be adding a new configuration for the number of > partitions to do automated leader rebalancing for at once, and *stop* once > that number of leader rebalances are in flight, until they're done. There may > be better mechanisms, and I'd love to hear if anybody has any ideas. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-4084) automated leader rebalance causes replication downtime for clusters with too many partitions
[ https://issues.apache.org/jira/browse/KAFKA-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076124#comment-17076124 ] GEORGE LI commented on KAFKA-4084: -- [~blodsbror] If this turns out to be positive in the testing. I can restart the discussion on the dev mailing list for KIP-491.at least it works/helps with auto.leader.rebalance.enable=true. There are other use cases listed in KIP-491. e.g. when controller is busy with metadata request, can set this dynamic config for the controller, run PLE, and controller will give up all its leadership, just as a follower, CPU usage down. 10-15%, making it light-weighted doing its work, no need to bounce the controller. I know some company is working on the feature of separating the controller to another set of machines. Our primary use case of this `leader.deprioritized.list=` feature is bundled together with another feature call replica.start.offlet.strategy=latest , which I have not filed for a KIP , (default is earliest like current kafka behavior), this is also a dynamic config. can be set for broker level (or global cluster). What it does is when a broker failed and lost all its local disk, and replaced with an empty broker, the empty broker will need to start replication from earliest offset by default, for us, this could be 20TB+ of data for a few hours and can cause outages if not throttled properly. So just like the kafka consumer, we introduce dynamic config replica.start.offlet.strategy=latest , to just replicate from each partition leader's latest offset. Once it's caught up (URP=> 0 for this broker) usually in 5-10minutes or sooner, then remove the dynamic config, Because this broker does not have all the historical data, it should not be serving leaderships. That's how the KIP-491. `leader.deprioritized.list=` is coming into play.The automation software will calculate the retention time at the broker and topic level, take the Max, and once the broker is in replication for that amount of time (e.g. 6 hours, 1 day, 3days, whatever,), the automation software will remove the leader.deprioritized.list dynamic config for the broker. and run PLE to change the leadership back to it. > automated leader rebalance causes replication downtime for clusters with too > many partitions > > > Key: KAFKA-4084 > URL: https://issues.apache.org/jira/browse/KAFKA-4084 > Project: Kafka > Issue Type: Bug > Components: controller >Affects Versions: 0.8.2.2, 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1 >Reporter: Tom Crayford >Priority: Major > Labels: reliability > Fix For: 1.1.0 > > > If you enable {{auto.leader.rebalance.enable}} (which is on by default), and > you have a cluster with many partitions, there is a severe amount of > replication downtime following a restart. This causes > `UnderReplicatedPartitions` to fire, and replication is paused. > This is because the current automated leader rebalance mechanism changes > leaders for *all* imbalanced partitions at once, instead of doing it > gradually. This effectively stops all replica fetchers in the cluster > (assuming there are enough imbalanced partitions), and restarts them. This > can take minutes on busy clusters, during which no replication is happening > and user data is at risk. Clients with {{acks=-1}} also see issues at this > time, because replication is effectively stalled. > To quote Todd Palino from the mailing list: > bq. There is an admin CLI command to trigger the preferred replica election > manually. There is also a broker configuration “auto.leader.rebalance.enable” > which you can set to have the broker automatically perform the PLE when > needed. DO NOT USE THIS OPTION. There are serious performance issues when > doing so, especially on larger clusters. It needs some development work that > has not been fully identified yet. > This setting is extremely useful for smaller clusters, but with high > partition counts causes the huge issues stated above. > One potential fix could be adding a new configuration for the number of > partitions to do automated leader rebalancing for at once, and *stop* once > that number of leader rebalances are in flight, until they're done. There may > be better mechanisms, and I'd love to hear if anybody has any ideas. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (KAFKA-4084) automated leader rebalance causes replication downtime for clusters with too many partitions
[ https://issues.apache.org/jira/browse/KAFKA-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076092#comment-17076092 ] Evan Williams edited comment on KAFKA-4084 at 4/6/20, 7:35 AM: --- [~sql_consulting] Thanks. Yes, I already have that logic scripted up which is good. However for me, it's much better to have auto.leader.rebalance.enable=true cluster wide if possible - so that it can take care of instance reboots, or service (planned/unplanned) restarts, automatically. And that's the real benefit of this patch. Being able to leave that enabled, yet not killing a replaced broker with traffic until UR=0 :) If this turns out to work well - what are the chances that it get's merged into an official release ? was (Author: blodsbror): [~sql_consulting] Thanks. Yes, I already have that logic scripted up which is good. However for me, it's much better to have auto.leader.rebalance.enable=true cluster wide if possible - so that it can take care of instance reboots, or service (planned/unplanned) restarts, automatically. And that's the real benefit of this patch. Being able to leave that enabled, yet not killing a replaced broker with traffic until UR=0 :) > automated leader rebalance causes replication downtime for clusters with too > many partitions > > > Key: KAFKA-4084 > URL: https://issues.apache.org/jira/browse/KAFKA-4084 > Project: Kafka > Issue Type: Bug > Components: controller >Affects Versions: 0.8.2.2, 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1 >Reporter: Tom Crayford >Priority: Major > Labels: reliability > Fix For: 1.1.0 > > > If you enable {{auto.leader.rebalance.enable}} (which is on by default), and > you have a cluster with many partitions, there is a severe amount of > replication downtime following a restart. This causes > `UnderReplicatedPartitions` to fire, and replication is paused. > This is because the current automated leader rebalance mechanism changes > leaders for *all* imbalanced partitions at once, instead of doing it > gradually. This effectively stops all replica fetchers in the cluster > (assuming there are enough imbalanced partitions), and restarts them. This > can take minutes on busy clusters, during which no replication is happening > and user data is at risk. Clients with {{acks=-1}} also see issues at this > time, because replication is effectively stalled. > To quote Todd Palino from the mailing list: > bq. There is an admin CLI command to trigger the preferred replica election > manually. There is also a broker configuration “auto.leader.rebalance.enable” > which you can set to have the broker automatically perform the PLE when > needed. DO NOT USE THIS OPTION. There are serious performance issues when > doing so, especially on larger clusters. It needs some development work that > has not been fully identified yet. > This setting is extremely useful for smaller clusters, but with high > partition counts causes the huge issues stated above. > One potential fix could be adding a new configuration for the number of > partitions to do automated leader rebalancing for at once, and *stop* once > that number of leader rebalances are in flight, until they're done. There may > be better mechanisms, and I'd love to hear if anybody has any ideas. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-4084) automated leader rebalance causes replication downtime for clusters with too many partitions
[ https://issues.apache.org/jira/browse/KAFKA-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076092#comment-17076092 ] Evan Williams commented on KAFKA-4084: -- [~sql_consulting] Thanks. Yes, I already have that logic scripted up which is good. However for me, it's much better to have auto.leader.rebalance.enable=true cluster wide if possible - so that it can take care of instance reboots, or service (planned/unplanned) restarts, automatically. And that's the real benefit of this patch. Being able to leave that enabled, yet not killing a replaced broker with traffic until UR=0 :) > automated leader rebalance causes replication downtime for clusters with too > many partitions > > > Key: KAFKA-4084 > URL: https://issues.apache.org/jira/browse/KAFKA-4084 > Project: Kafka > Issue Type: Bug > Components: controller >Affects Versions: 0.8.2.2, 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1 >Reporter: Tom Crayford >Priority: Major > Labels: reliability > Fix For: 1.1.0 > > > If you enable {{auto.leader.rebalance.enable}} (which is on by default), and > you have a cluster with many partitions, there is a severe amount of > replication downtime following a restart. This causes > `UnderReplicatedPartitions` to fire, and replication is paused. > This is because the current automated leader rebalance mechanism changes > leaders for *all* imbalanced partitions at once, instead of doing it > gradually. This effectively stops all replica fetchers in the cluster > (assuming there are enough imbalanced partitions), and restarts them. This > can take minutes on busy clusters, during which no replication is happening > and user data is at risk. Clients with {{acks=-1}} also see issues at this > time, because replication is effectively stalled. > To quote Todd Palino from the mailing list: > bq. There is an admin CLI command to trigger the preferred replica election > manually. There is also a broker configuration “auto.leader.rebalance.enable” > which you can set to have the broker automatically perform the PLE when > needed. DO NOT USE THIS OPTION. There are serious performance issues when > doing so, especially on larger clusters. It needs some development work that > has not been fully identified yet. > This setting is extremely useful for smaller clusters, but with high > partition counts causes the huge issues stated above. > One potential fix could be adding a new configuration for the number of > partitions to do automated leader rebalancing for at once, and *stop* once > that number of leader rebalances are in flight, until they're done. There may > be better mechanisms, and I'd love to hear if anybody has any ideas. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (KAFKA-4084) automated leader rebalance causes replication downtime for clusters with too many partitions
[ https://issues.apache.org/jira/browse/KAFKA-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076064#comment-17076064 ] Evan Williams edited comment on KAFKA-4084 at 4/6/20, 7:29 AM: --- Much appreciated [~sql_consulting]. I've requested access to the Google doc. I will implement and let you know how things go!. If this turns out to work well - what are the chances that it get's merged into an official release ? Edit: Answered in the doc, thanks! One question though - is it reasonable to still set auto.leader.rebalance.enable=true ?, on all brokers, with this new functionality ? was (Author: blodsbror): Much appreciated [~sql_consulting]. I've requested access to the Google doc. I will implement and let you know how things go!. Edit: Answered in the doc, thanks! One question though - is it reasonable to still set auto.leader.rebalance.enable=true ?, on all brokers, with this new functionality ? > automated leader rebalance causes replication downtime for clusters with too > many partitions > > > Key: KAFKA-4084 > URL: https://issues.apache.org/jira/browse/KAFKA-4084 > Project: Kafka > Issue Type: Bug > Components: controller >Affects Versions: 0.8.2.2, 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1 >Reporter: Tom Crayford >Priority: Major > Labels: reliability > Fix For: 1.1.0 > > > If you enable {{auto.leader.rebalance.enable}} (which is on by default), and > you have a cluster with many partitions, there is a severe amount of > replication downtime following a restart. This causes > `UnderReplicatedPartitions` to fire, and replication is paused. > This is because the current automated leader rebalance mechanism changes > leaders for *all* imbalanced partitions at once, instead of doing it > gradually. This effectively stops all replica fetchers in the cluster > (assuming there are enough imbalanced partitions), and restarts them. This > can take minutes on busy clusters, during which no replication is happening > and user data is at risk. Clients with {{acks=-1}} also see issues at this > time, because replication is effectively stalled. > To quote Todd Palino from the mailing list: > bq. There is an admin CLI command to trigger the preferred replica election > manually. There is also a broker configuration “auto.leader.rebalance.enable” > which you can set to have the broker automatically perform the PLE when > needed. DO NOT USE THIS OPTION. There are serious performance issues when > doing so, especially on larger clusters. It needs some development work that > has not been fully identified yet. > This setting is extremely useful for smaller clusters, but with high > partition counts causes the huge issues stated above. > One potential fix could be adding a new configuration for the number of > partitions to do automated leader rebalancing for at once, and *stop* once > that number of leader rebalances are in flight, until they're done. There may > be better mechanisms, and I'd love to hear if anybody has any ideas. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-4084) automated leader rebalance causes replication downtime for clusters with too many partitions
[ https://issues.apache.org/jira/browse/KAFKA-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076089#comment-17076089 ] GEORGE LI commented on KAFKA-4084: -- [~blodsbror] I think it's still ok to set auto.leader.rebalance.enable=true. but for the broker that had failed and replaced coming up empty. There should be some kind of automation that first set the `leader.deprioritized.list=` dynamic config at '' cluster global level, so current controller and possible another failover controller can make it in effect immediately. Then start the new replaced broker. During the time the broker is busy catching up. Because it's in the lower priority for being considered for leaders, the auto.leader.rebalance.enable=true will be sort of disabled automatically for this broker. After this broker catches up. e.g. URP => 0, CPU/Disk, etc. back to normal. the dynamic config above can be removed by the automation script. and with auto.leader.rebalance.enable=true, the leaders will be auto going to its Preferred leader (first/head of the partition assignment) of this broker. > automated leader rebalance causes replication downtime for clusters with too > many partitions > > > Key: KAFKA-4084 > URL: https://issues.apache.org/jira/browse/KAFKA-4084 > Project: Kafka > Issue Type: Bug > Components: controller >Affects Versions: 0.8.2.2, 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1 >Reporter: Tom Crayford >Priority: Major > Labels: reliability > Fix For: 1.1.0 > > > If you enable {{auto.leader.rebalance.enable}} (which is on by default), and > you have a cluster with many partitions, there is a severe amount of > replication downtime following a restart. This causes > `UnderReplicatedPartitions` to fire, and replication is paused. > This is because the current automated leader rebalance mechanism changes > leaders for *all* imbalanced partitions at once, instead of doing it > gradually. This effectively stops all replica fetchers in the cluster > (assuming there are enough imbalanced partitions), and restarts them. This > can take minutes on busy clusters, during which no replication is happening > and user data is at risk. Clients with {{acks=-1}} also see issues at this > time, because replication is effectively stalled. > To quote Todd Palino from the mailing list: > bq. There is an admin CLI command to trigger the preferred replica election > manually. There is also a broker configuration “auto.leader.rebalance.enable” > which you can set to have the broker automatically perform the PLE when > needed. DO NOT USE THIS OPTION. There are serious performance issues when > doing so, especially on larger clusters. It needs some development work that > has not been fully identified yet. > This setting is extremely useful for smaller clusters, but with high > partition counts causes the huge issues stated above. > One potential fix could be adding a new configuration for the number of > partitions to do automated leader rebalancing for at once, and *stop* once > that number of leader rebalances are in flight, until they're done. There may > be better mechanisms, and I'd love to hear if anybody has any ideas. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (KAFKA-4084) automated leader rebalance causes replication downtime for clusters with too many partitions
[ https://issues.apache.org/jira/browse/KAFKA-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076064#comment-17076064 ] Evan Williams edited comment on KAFKA-4084 at 4/6/20, 7:24 AM: --- Much appreciated [~sql_consulting]. I've requested access to the Google doc. I will implement and let you know how things go!. Edit: Answered in the doc, thanks! One question though - is it reasonable to still set auto.leader.rebalance.enable=true ?, on all brokers, with this new functionality ? was (Author: blodsbror): Much appreciated [~sql_consulting]. I've requested access to the Google doc. I will implement and let you know how things go!. One question though - is it reasonable to still set auto.leader.rebalance.enable=true ?, on all brokers, with this new functionality ? > automated leader rebalance causes replication downtime for clusters with too > many partitions > > > Key: KAFKA-4084 > URL: https://issues.apache.org/jira/browse/KAFKA-4084 > Project: Kafka > Issue Type: Bug > Components: controller >Affects Versions: 0.8.2.2, 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1 >Reporter: Tom Crayford >Priority: Major > Labels: reliability > Fix For: 1.1.0 > > > If you enable {{auto.leader.rebalance.enable}} (which is on by default), and > you have a cluster with many partitions, there is a severe amount of > replication downtime following a restart. This causes > `UnderReplicatedPartitions` to fire, and replication is paused. > This is because the current automated leader rebalance mechanism changes > leaders for *all* imbalanced partitions at once, instead of doing it > gradually. This effectively stops all replica fetchers in the cluster > (assuming there are enough imbalanced partitions), and restarts them. This > can take minutes on busy clusters, during which no replication is happening > and user data is at risk. Clients with {{acks=-1}} also see issues at this > time, because replication is effectively stalled. > To quote Todd Palino from the mailing list: > bq. There is an admin CLI command to trigger the preferred replica election > manually. There is also a broker configuration “auto.leader.rebalance.enable” > which you can set to have the broker automatically perform the PLE when > needed. DO NOT USE THIS OPTION. There are serious performance issues when > doing so, especially on larger clusters. It needs some development work that > has not been fully identified yet. > This setting is extremely useful for smaller clusters, but with high > partition counts causes the huge issues stated above. > One potential fix could be adding a new configuration for the number of > partitions to do automated leader rebalancing for at once, and *stop* once > that number of leader rebalances are in flight, until they're done. There may > be better mechanisms, and I'd love to hear if anybody has any ideas. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (KAFKA-4084) automated leader rebalance causes replication downtime for clusters with too many partitions
[ https://issues.apache.org/jira/browse/KAFKA-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076006#comment-17076006 ] GEORGE LI edited comment on KAFKA-4084 at 4/6/20, 7:13 AM: --- [~blodsbror] Have some free time this weekend to troubleshoot and found out after 1.1 , at least in 2.4, the controller code has some optimization for PLE, not running PLE at all if current_leader == Head of Replica. That had cause my unit/integration tests to fail. I have patched that as well. I have landed my code change to my repo's feature branch. 2.4-leader-deprioritized-list (based on the 2.4 branch) detail installation and testing steps in this [Google doc|https://docs.google.com/document/d/14vlPkbaog_5Xdd-HB4vMRaQQ7Fq4SlxddULsvc3PlbY/edit]. Please let me know if you have issues with the patch/testing. If can not view the doc, please click the request access button. or send me your email to add to the share. my email: sqlconsult...@gmail.com Please keep us posted with your testing results. Thanks, George was (Author: sql_consulting): [~blodsbror] Have some free time this weekend to troubleshoot and found out after 1.1 , at least in 2.4, the controller code has some optimization for PLE, not running PLE at all if current_leader == Head of Replica. That had cause my unit/integration tests to fail. I have patched that as well. I have landed my code change to my repo's feature branch. 2.4-leader-deprioritized-list (based on the 2.4 branch) detail installation and testing steps in this [Google doc|https://docs.google.com/document/d/1ZuOcYTSuCAqCut_hjI_EY3lA9W7BuIlHVUdOUcSpWww/edit]. Please let me know if you have issues with the patch/testing. If can not view the doc, please click the request access button. or send me your email to add to the share. my email: sqlconsult...@gmail.com Please keep us posted with your testing results. Thanks, George > automated leader rebalance causes replication downtime for clusters with too > many partitions > > > Key: KAFKA-4084 > URL: https://issues.apache.org/jira/browse/KAFKA-4084 > Project: Kafka > Issue Type: Bug > Components: controller >Affects Versions: 0.8.2.2, 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1 >Reporter: Tom Crayford >Priority: Major > Labels: reliability > Fix For: 1.1.0 > > > If you enable {{auto.leader.rebalance.enable}} (which is on by default), and > you have a cluster with many partitions, there is a severe amount of > replication downtime following a restart. This causes > `UnderReplicatedPartitions` to fire, and replication is paused. > This is because the current automated leader rebalance mechanism changes > leaders for *all* imbalanced partitions at once, instead of doing it > gradually. This effectively stops all replica fetchers in the cluster > (assuming there are enough imbalanced partitions), and restarts them. This > can take minutes on busy clusters, during which no replication is happening > and user data is at risk. Clients with {{acks=-1}} also see issues at this > time, because replication is effectively stalled. > To quote Todd Palino from the mailing list: > bq. There is an admin CLI command to trigger the preferred replica election > manually. There is also a broker configuration “auto.leader.rebalance.enable” > which you can set to have the broker automatically perform the PLE when > needed. DO NOT USE THIS OPTION. There are serious performance issues when > doing so, especially on larger clusters. It needs some development work that > has not been fully identified yet. > This setting is extremely useful for smaller clusters, but with high > partition counts causes the huge issues stated above. > One potential fix could be adding a new configuration for the number of > partitions to do automated leader rebalancing for at once, and *stop* once > that number of leader rebalances are in flight, until they're done. There may > be better mechanisms, and I'd love to hear if anybody has any ideas. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-4084) automated leader rebalance causes replication downtime for clusters with too many partitions
[ https://issues.apache.org/jira/browse/KAFKA-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076064#comment-17076064 ] Evan Williams commented on KAFKA-4084: -- Much appreciated [~sql_consulting]. I've requested access to the Google doc. I will implement and let you know how things go!. One question though - is it reasonable to still set auto.leader.rebalance.enable=true ?, on all brokers, with this new functionality ? > automated leader rebalance causes replication downtime for clusters with too > many partitions > > > Key: KAFKA-4084 > URL: https://issues.apache.org/jira/browse/KAFKA-4084 > Project: Kafka > Issue Type: Bug > Components: controller >Affects Versions: 0.8.2.2, 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1 >Reporter: Tom Crayford >Priority: Major > Labels: reliability > Fix For: 1.1.0 > > > If you enable {{auto.leader.rebalance.enable}} (which is on by default), and > you have a cluster with many partitions, there is a severe amount of > replication downtime following a restart. This causes > `UnderReplicatedPartitions` to fire, and replication is paused. > This is because the current automated leader rebalance mechanism changes > leaders for *all* imbalanced partitions at once, instead of doing it > gradually. This effectively stops all replica fetchers in the cluster > (assuming there are enough imbalanced partitions), and restarts them. This > can take minutes on busy clusters, during which no replication is happening > and user data is at risk. Clients with {{acks=-1}} also see issues at this > time, because replication is effectively stalled. > To quote Todd Palino from the mailing list: > bq. There is an admin CLI command to trigger the preferred replica election > manually. There is also a broker configuration “auto.leader.rebalance.enable” > which you can set to have the broker automatically perform the PLE when > needed. DO NOT USE THIS OPTION. There are serious performance issues when > doing so, especially on larger clusters. It needs some development work that > has not been fully identified yet. > This setting is extremely useful for smaller clusters, but with high > partition counts causes the huge issues stated above. > One potential fix could be adding a new configuration for the number of > partitions to do automated leader rebalancing for at once, and *stop* once > that number of leader rebalances are in flight, until they're done. There may > be better mechanisms, and I'd love to hear if anybody has any ideas. -- This message was sent by Atlassian Jira (v8.3.4#803005)