[jira] [Commented] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId

2020-04-06 Thread Raman Gupta (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076895#comment-17076895
 ] 

Raman Gupta commented on KAFKA-8803:


We are on 2.4.1 now, locally patched with the fix for KAFKA-9749. So far so 
good.

> Stream will not start due to TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> 
>
> Key: KAFKA-8803
> URL: https://issues.apache.org/jira/browse/KAFKA-8803
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Raman Gupta
>Assignee: Sophie Blee-Goldman
>Priority: Major
> Attachments: logs-20200311.txt.gz, logs-client-20200311.txt.gz, 
> logs.txt.gz, screenshot-1.png
>
>
> One streams app is consistently failing at startup with the following 
> exception:
> {code}
> 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] 
> org.apa.kaf.str.pro.int.StreamTask: task [0_36] Timeout 
> exception caught when initializing transactions for task 0_36. This might 
> happen if the broker is slow to respond, if the network connection to the 
> broker was interrupted, or if similar circumstances arise. You can increase 
> producer parameter `max.block.ms` to increase this timeout.
> org.apache.kafka.common.errors.TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> {code}
> These same brokers are used by many other streams without any issue, 
> including some in the very same processes for the stream which consistently 
> throws this exception.
> *UPDATE 08/16:*
> The very first instance of this error is August 13th 2019, 17:03:36.754 and 
> it happened for 4 different streams. For 3 of these streams, the error only 
> happened once, and then the stream recovered. For the 4th stream, the error 
> has continued to happen, and continues to happen now.
> I looked up the broker logs for this time, and see that at August 13th 2019, 
> 16:47:43, two of four brokers started reporting messages like this, for 
> multiple partitions:
> [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, 
> fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader 
> reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
> The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, 
> here is a view of the count of these messages over time:
>  !screenshot-1.png! 
> However, as noted, the stream task timeout error continues to happen.
> I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 
> broker. The broker has a patch for KAFKA-8773.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-9827) MM2 doesnt replicate data

2020-04-06 Thread Dmitry (Jira)
Dmitry created KAFKA-9827:
-

 Summary: MM2 doesnt replicate data
 Key: KAFKA-9827
 URL: https://issues.apache.org/jira/browse/KAFKA-9827
 Project: Kafka
  Issue Type: Bug
  Components: mirrormaker
Affects Versions: 2.4.1, 2.4.0
Reporter: Dmitry
 Attachments: mm2.log

I have 2 servers with different node count:
 # 2.4.0 version, 3 nodes, broker port plaintext:9090
 # 2.4.0 version, 1 nodes, broker port plaintext:

I want to transfer data from cluster 1 -> cluster 2 by mirror maker 2.

MM2 configuration:

 
{code:java}
name = mm2-backupFlow
topics = .*
groups = .*

# specify any number of cluster aliases
clusters = m1-source, m1-backup
# connection information for each cluster
m1-source.bootstrap.servers = 172.17.165.49:9090, 172.17.165.50:9090, 
172.17.165.51:9090
m1-backup.bootstrap.servers = 172.17.165.52:
# enable and configure individual replication flows
m1-source->m1-backup.enabled = true
m1-soruce->m1-backup.emit.heartbeats.enabled = false
{code}
 

When i start MM2 on each clusters created 3 topics:
 * heartbeat
 * *.internal

And thats all. Nothing to happend.

I watching next ERROR's:
{code:java}
[2020-04-07 07:19:16,821] ERROR Plugin class loader for connector: 
'org.apache.kafka.connect.mirror.MirrorHeartbeatConnector' was not found. 
Returning: 
org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader@1e5f4170 
(org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader:165)
{code}
{code:java}
[2020-04-07 07:19:18,312] ERROR Plugin class loader for connector: 
'org.apache.kafka.connect.mirror.MirrorCheckpointConnector' was not found. 
Returning: 
org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader@1e5f4170 
(org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader:165)
{code}
{code:java}
[2020-04-07 07:19:16,807] ERROR Plugin class loader for connector: 
'org.apache.kafka.connect.mirror.MirrorSourceConnector' was not found. 
Returning: 
org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader@1e5f4170 
(org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader:165){code}
Full log in attach

What i missed ? Help please



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9818) Flaky Test RecordCollectorTest.shouldNotThrowStreamsExceptionOnSubsequentCallIfASendFailsWithContinueExceptionHandler

2020-04-06 Thread Matthias J. Sax (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076890#comment-17076890
 ] 

Matthias J. Sax commented on KAFKA-9818:


This might be related to https://issues.apache.org/jira/browse/KAFKA-9819 and 
we might want to do the same fix for both. Thank to [~ableegoldman] for 
pointing it out.

> Flaky Test 
> RecordCollectorTest.shouldNotThrowStreamsExceptionOnSubsequentCallIfASendFailsWithContinueExceptionHandler
> -
>
> Key: KAFKA-9818
> URL: https://issues.apache.org/jira/browse/KAFKA-9818
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Sophie Blee-Goldman
>Assignee: Matthias J. Sax
>Priority: Major
>  Labels: flaky-test
>
> h3. Error Message
> java.lang.AssertionError
> h3. Stacktrace
> java.lang.AssertionError at org.junit.Assert.fail(Assert.java:87) at 
> org.junit.Assert.assertTrue(Assert.java:42) at 
> org.junit.Assert.assertTrue(Assert.java:53) at 
> org.apache.kafka.streams.processor.internals.RecordCollectorTest.shouldNotThrowStreamsExceptionOnSubsequentCallIfASendFailsWithContinueExceptionHandler(RecordCollectorTest.java:521)
>  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method) at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.base/java.lang.reflect.Method.invoke(Method.java:566) at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
>  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
>  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>  at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) 
> at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) 
> at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at 
> org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
>  at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
>  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
>  at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) at 
> org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) at 
> org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:413) at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.runTestClass(JUnitTestClassExecutor.java:110)
>  at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:58)
>  at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:38)
>  at 
> org.gradle.api.internal.tasks.testing.junit.AbstractJUnitTestClassProcessor.processTestClass(AbstractJUnitTestClassProcessor.java:62)
>  at 
> org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.processTestClass(SuiteTestClassProcessor.java:51)
>  at jdk.internal.reflect.GeneratedMethodAccessor20.invoke(Unknown Source) at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.base/java.lang.reflect.Method.invoke(Method.java:566) at 
> org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:36)
>  at 
> org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
>  at 
> org.gradle.internal.dispatch.ContextClassLoaderDispatch.dispatch(ContextClassLoaderDispatch.java:33)
>  at 
> org.gradle.internal.dispatch.ProxyDispatchAdapter$DispatchingInvocationHandler.invoke(ProxyDispatchAdapter.java:94)
>  at com.sun.proxy.$Proxy5.processTestClass(Unknown Source) at 
> org.gradle.api.internal.tasks.testing.worker.TestWorker.processTestClass(TestWorker.java:118)
>  at jdk.internal.reflect.GeneratedMethodAccessor19.invoke(Unknown Source) at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.base/java.lang.reflect.Method.invoke(Method.java:566) at 
> 

[jira] [Commented] (KAFKA-9819) Flaky Test StoreChangelogReaderTest#shouldNotThrowOnUnknownRevokedPartition[0]

2020-04-06 Thread Matthias J. Sax (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076889#comment-17076889
 ] 

Matthias J. Sax commented on KAFKA-9819:


This might be related to https://issues.apache.org/jira/browse/KAFKA-9818 and 
we might want to do the same fix for both. Thank to [~ableegoldman] for 
pointing it out.

> Flaky Test StoreChangelogReaderTest#shouldNotThrowOnUnknownRevokedPartition[0]
> --
>
> Key: KAFKA-9819
> URL: https://issues.apache.org/jira/browse/KAFKA-9819
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Sophie Blee-Goldman
>Priority: Major
>  Labels: flaky-test
>
> h3. Error Message
> java.lang.AssertionError: expected:<[test-reader Changelog partition 
> unknown-0 could not be found, it could be already cleaned up during the 
> handlingof task corruption and never restore again]> but was:<[[AdminClient 
> clientId=adminclient-91] Connection to node -1 (localhost/127.0.0.1:8080) 
> could not be established. Broker may not be available., test-reader Changelog 
> partition unknown-0 could not be found, it could be already cleaned up during 
> the handlingof task corruption and never restore again]>
> h3. Stacktrace
> java.lang.AssertionError: expected:<[test-reader Changelog partition 
> unknown-0 could not be found, it could be already cleaned up during the 
> handlingof task corruption and never restore again]> but was:<[[AdminClient 
> clientId=adminclient-91] Connection to node -1 (localhost/127.0.0.1:8080) 
> could not be established. Broker may not be available., test-reader Changelog 
> partition unknown-0 could not be found, it could be already cleaned up during 
> the handlingof task corruption and never restore again]>



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KAFKA-9819) Flaky Test StoreChangelogReaderTest#shouldNotThrowOnUnknownRevokedPartition[0]

2020-04-06 Thread Matthias J. Sax (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias J. Sax reassigned KAFKA-9819:
--

Assignee: Matthias J. Sax

> Flaky Test StoreChangelogReaderTest#shouldNotThrowOnUnknownRevokedPartition[0]
> --
>
> Key: KAFKA-9819
> URL: https://issues.apache.org/jira/browse/KAFKA-9819
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Sophie Blee-Goldman
>Assignee: Matthias J. Sax
>Priority: Major
>  Labels: flaky-test
>
> h3. Error Message
> java.lang.AssertionError: expected:<[test-reader Changelog partition 
> unknown-0 could not be found, it could be already cleaned up during the 
> handlingof task corruption and never restore again]> but was:<[[AdminClient 
> clientId=adminclient-91] Connection to node -1 (localhost/127.0.0.1:8080) 
> could not be established. Broker may not be available., test-reader Changelog 
> partition unknown-0 could not be found, it could be already cleaned up during 
> the handlingof task corruption and never restore again]>
> h3. Stacktrace
> java.lang.AssertionError: expected:<[test-reader Changelog partition 
> unknown-0 could not be found, it could be already cleaned up during the 
> handlingof task corruption and never restore again]> but was:<[[AdminClient 
> clientId=adminclient-91] Connection to node -1 (localhost/127.0.0.1:8080) 
> could not be established. Broker may not be available., test-reader Changelog 
> partition unknown-0 could not be found, it could be already cleaned up during 
> the handlingof task corruption and never restore again]>



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9818) Flaky Test RecordCollectorTest.shouldNotThrowStreamsExceptionOnSubsequentCallIfASendFailsWithContinueExceptionHandler

2020-04-06 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076888#comment-17076888
 ] 

ASF GitHub Bot commented on KAFKA-9818:
---

mjsax commented on pull request #8423: KAFKA-9818: improve error message to 
debug test
URL: https://github.com/apache/kafka/pull/8423
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Flaky Test 
> RecordCollectorTest.shouldNotThrowStreamsExceptionOnSubsequentCallIfASendFailsWithContinueExceptionHandler
> -
>
> Key: KAFKA-9818
> URL: https://issues.apache.org/jira/browse/KAFKA-9818
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Sophie Blee-Goldman
>Assignee: Matthias J. Sax
>Priority: Major
>  Labels: flaky-test
>
> h3. Error Message
> java.lang.AssertionError
> h3. Stacktrace
> java.lang.AssertionError at org.junit.Assert.fail(Assert.java:87) at 
> org.junit.Assert.assertTrue(Assert.java:42) at 
> org.junit.Assert.assertTrue(Assert.java:53) at 
> org.apache.kafka.streams.processor.internals.RecordCollectorTest.shouldNotThrowStreamsExceptionOnSubsequentCallIfASendFailsWithContinueExceptionHandler(RecordCollectorTest.java:521)
>  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method) at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.base/java.lang.reflect.Method.invoke(Method.java:566) at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
>  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
>  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>  at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) 
> at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) 
> at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at 
> org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
>  at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
>  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
>  at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) at 
> org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) at 
> org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:413) at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.runTestClass(JUnitTestClassExecutor.java:110)
>  at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:58)
>  at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:38)
>  at 
> org.gradle.api.internal.tasks.testing.junit.AbstractJUnitTestClassProcessor.processTestClass(AbstractJUnitTestClassProcessor.java:62)
>  at 
> org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.processTestClass(SuiteTestClassProcessor.java:51)
>  at jdk.internal.reflect.GeneratedMethodAccessor20.invoke(Unknown Source) at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.base/java.lang.reflect.Method.invoke(Method.java:566) at 
> org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:36)
>  at 
> org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
>  at 
> org.gradle.internal.dispatch.ContextClassLoaderDispatch.dispatch(ContextClassLoaderDispatch.java:33)
>  at 
> org.gradle.internal.dispatch.ProxyDispatchAdapter$DispatchingInvocationHandler.invoke(ProxyDispatchAdapter.java:94)
>  at com.sun.proxy.$Proxy5.processTestClass(Unknown Source) at 
> org.gradle.api.internal.tasks.testing.worker.TestWorker.processTestClass(TestWorker.java:118)
>  at 

[jira] [Commented] (KAFKA-8264) Flaky Test PlaintextConsumerTest#testLowMaxFetchSizeForRequestAndPartition

2020-04-06 Thread Matthias J. Sax (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076887#comment-17076887
 ] 

Matthias J. Sax commented on KAFKA-8264:


[https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/1645/testReport/junit/kafka.api/PlaintextConsumerTest/testLowMaxFetchSizeForRequestAndPartition/]

> Flaky Test PlaintextConsumerTest#testLowMaxFetchSizeForRequestAndPartition
> --
>
> Key: KAFKA-8264
> URL: https://issues.apache.org/jira/browse/KAFKA-8264
> Project: Kafka
>  Issue Type: Bug
>  Components: core, unit tests
>Affects Versions: 2.0.1, 2.3.0
>Reporter: Matthias J. Sax
>Assignee: Stanislav Kozlovski
>Priority: Critical
>  Labels: flaky-test
> Fix For: 2.6.0
>
>
> [https://builds.apache.org/blue/organizations/jenkins/kafka-2.0-jdk8/detail/kafka-2.0-jdk8/252/tests]
> {quote}org.apache.kafka.common.errors.TopicExistsException: Topic 'topic3' 
> already exists.{quote}
> STDOUT
>  
> {quote}[2019-04-19 03:54:20,080] ERROR [ReplicaFetcher replicaId=1, 
> leaderId=0, fetcherId=0] Error for partition __consumer_offsets-0 at offset 0 
> (kafka.server.ReplicaFetcherThread:76)
> org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
> does not host this topic-partition.
> [2019-04-19 03:54:20,080] ERROR [ReplicaFetcher replicaId=2, leaderId=0, 
> fetcherId=0] Error for partition __consumer_offsets-0 at offset 0 
> (kafka.server.ReplicaFetcherThread:76)
> org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
> does not host this topic-partition.
> [2019-04-19 03:54:20,312] ERROR [ReplicaFetcher replicaId=2, leaderId=0, 
> fetcherId=0] Error for partition topic-0 at offset 0 
> (kafka.server.ReplicaFetcherThread:76)
> org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
> does not host this topic-partition.
> [2019-04-19 03:54:20,313] ERROR [ReplicaFetcher replicaId=2, leaderId=1, 
> fetcherId=0] Error for partition topic-1 at offset 0 
> (kafka.server.ReplicaFetcherThread:76)
> org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
> does not host this topic-partition.
> [2019-04-19 03:54:20,994] ERROR [ReplicaFetcher replicaId=1, leaderId=0, 
> fetcherId=0] Error for partition topic-0 at offset 0 
> (kafka.server.ReplicaFetcherThread:76)
> org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
> does not host this topic-partition.
> [2019-04-19 03:54:21,727] ERROR [ReplicaFetcher replicaId=0, leaderId=2, 
> fetcherId=0] Error for partition topicWithNewMessageFormat-1 at offset 0 
> (kafka.server.ReplicaFetcherThread:76)
> org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
> does not host this topic-partition.
> [2019-04-19 03:54:28,696] ERROR [ReplicaFetcher replicaId=0, leaderId=2, 
> fetcherId=0] Error for partition __consumer_offsets-0 at offset 0 
> (kafka.server.ReplicaFetcherThread:76)
> org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
> does not host this topic-partition.
> [2019-04-19 03:54:28,699] ERROR [ReplicaFetcher replicaId=1, leaderId=2, 
> fetcherId=0] Error for partition __consumer_offsets-0 at offset 0 
> (kafka.server.ReplicaFetcherThread:76)
> org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
> does not host this topic-partition.
> [2019-04-19 03:54:29,246] ERROR [ReplicaFetcher replicaId=0, leaderId=2, 
> fetcherId=0] Error for partition topic-1 at offset 0 
> (kafka.server.ReplicaFetcherThread:76)
> org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
> does not host this topic-partition.
> [2019-04-19 03:54:29,247] ERROR [ReplicaFetcher replicaId=0, leaderId=1, 
> fetcherId=0] Error for partition topic-0 at offset 0 
> (kafka.server.ReplicaFetcherThread:76)
> org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
> does not host this topic-partition.
> [2019-04-19 03:54:29,287] ERROR [ReplicaFetcher replicaId=1, leaderId=2, 
> fetcherId=0] Error for partition topic-1 at offset 0 
> (kafka.server.ReplicaFetcherThread:76)
> org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
> does not host this topic-partition.
> [2019-04-19 03:54:33,408] ERROR [ReplicaFetcher replicaId=0, leaderId=2, 
> fetcherId=0] Error for partition __consumer_offsets-0 at offset 0 
> (kafka.server.ReplicaFetcherThread:76)
> org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
> does not host this topic-partition.
> [2019-04-19 03:54:33,408] ERROR [ReplicaFetcher replicaId=1, leaderId=2, 
> fetcherId=0] Error for partition __consumer_offsets-0 at offset 0 
> (kafka.server.ReplicaFetcherThread:76)
> 

[jira] [Commented] (KAFKA-7965) Flaky Test ConsumerBounceTest#testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup

2020-04-06 Thread Matthias J. Sax (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-7965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076886#comment-17076886
 ] 

Matthias J. Sax commented on KAFKA-7965:


[https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/1645/testReport/junit/kafka.api/ConsumerBounceTest/testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup/]

and

[https://builds.apache.org/job/kafka-pr-jdk11-scala2.13/5664/testReport/junit/kafka.api/ConsumerBounceTest/testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup/]

> Flaky Test 
> ConsumerBounceTest#testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup
> 
>
> Key: KAFKA-7965
> URL: https://issues.apache.org/jira/browse/KAFKA-7965
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, consumer, unit tests
>Affects Versions: 1.1.1, 2.2.0, 2.3.0
>Reporter: Matthias J. Sax
>Priority: Critical
>  Labels: flaky-test
> Fix For: 2.3.0
>
>
> To get stable nightly builds for `2.2` release, I create tickets for all 
> observed test failures.
> [https://jenkins.confluent.io/job/apache-kafka-test/job/2.2/21/]
> {quote}java.lang.AssertionError: Received 0, expected at least 68 at 
> org.junit.Assert.fail(Assert.java:88) at 
> org.junit.Assert.assertTrue(Assert.java:41) at 
> kafka.api.ConsumerBounceTest.receiveAndCommit(ConsumerBounceTest.scala:557) 
> at 
> kafka.api.ConsumerBounceTest.$anonfun$testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup$1(ConsumerBounceTest.scala:320)
>  at 
> kafka.api.ConsumerBounceTest.$anonfun$testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup$1$adapted(ConsumerBounceTest.scala:319)
>  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) 
> at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) 
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at 
> kafka.api.ConsumerBounceTest.testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup(ConsumerBounceTest.scala:319){quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-6145) Warm up new KS instances before migrating tasks - potentially a two phase rebalance

2020-04-06 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-6145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076870#comment-17076870
 ] 

ASF GitHub Bot commented on KAFKA-6145:
---

ableegoldman commented on pull request #8436: KAFKA-6145: KIP-441 avoid 
unnecessary movement of standbys
URL: https://github.com/apache/kafka/pull/8436
 
 
   Currently we add warmup and standby tasks, meaning we first assign up to 
max.warmup.replica warmup tasks, and then attempt to assign num.standby copies 
of each stateful task. This can cause unnecessary transient standbys to pop up 
for the lifetime of the warmup task, which are presumably not what the user 
wanted.
   
   Note that we don’t want to simply count all warmups against the configured 
num.standbys, as this may cause the opposite problem where a standby we intend 
to keep is temporarily unassigned (which may lead to the cleanup thread 
deleting it). We should only count this as a standby if the destination client 
already had this task as a standby; otherwise, the standby already exists on 
some other client, so we should aim to give it back.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Warm up new KS instances before migrating tasks - potentially a two phase 
> rebalance
> ---
>
> Key: KAFKA-6145
> URL: https://issues.apache.org/jira/browse/KAFKA-6145
> Project: Kafka
>  Issue Type: New Feature
>  Components: streams
>Reporter: Antony Stubbs
>Assignee: Sophie Blee-Goldman
>Priority: Major
>  Labels: needs-kip
>
> Currently when expanding the KS cluster, the new node's partitions will be 
> unavailable during the rebalance, which for large states can take a very long 
> time, or for small state stores even more than a few ms can be a deal breaker 
> for micro service use cases.
> One workaround would be two execute the rebalance in two phases:
> 1) start running state store building on the new node
> 2) once the state store is fully populated on the new node, only then 
> rebalance the tasks - there will still be a rebalance pause, but would be 
> greatly reduced
> Relates to: KAFKA-6144 - Allow state stores to serve stale reads during 
> rebalance



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-9826) Log cleaning repeatedly picks same segment with no effect when first dirty offset is past start of active segment

2020-04-06 Thread Steve Rodrigues (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Rodrigues updated KAFKA-9826:
---
Summary: Log cleaning repeatedly picks same segment with no effect when 
first dirty offset is past start of active segment  (was: Log cleaning goes in 
infinite loop when first dirty offset is past start of active segment)

> Log cleaning repeatedly picks same segment with no effect when first dirty 
> offset is past start of active segment
> -
>
> Key: KAFKA-9826
> URL: https://issues.apache.org/jira/browse/KAFKA-9826
> Project: Kafka
>  Issue Type: Bug
>  Components: log cleaner
>Affects Versions: 2.4.1
>Reporter: Steve Rodrigues
>Assignee: Steve Rodrigues
>Priority: Major
>
> Seen on a system where a given partition had a single segment, and for 
> whatever reason (deleteRecords?), the logStartOffset was greater than the 
> base segment of the log, there were a continuous series of 
> ```
> [2020-03-03 16:56:31,374] WARN Resetting first dirty offset of  FOO-3 to log 
> start offset 55649 since the checkpointed offset 0 is invalid. 
> (kafka.log.LogCleanerManager$)
> ```
> messages (partition name changed, it wasn't really FOO). This was expected to 
> be resolved by KAFKA-6266 but clearly wasn't. 
> Further investigation revealed that  a few segments were continuously 
> cleaning and generating messages in the `log-cleaner.log` of the form:
> ```
> [2020-03-31 13:34:50,924] INFO Cleaner 1: Beginning cleaning of log FOO-3 
> (kafka.log.LogCleaner)
> [2020-03-31 13:34:50,924] INFO Cleaner 1: Building offset map for FOO-3... 
> (kafka.log.LogCleaner)
> [2020-03-31 13:34:50,927] INFO Cleaner 1: Building offset map for log FOO-3 
> for 0 segments in offset range [55287, 54237). (kafka.log.LogCleaner)
> [2020-03-31 13:34:50,927] INFO Cleaner 1: Offset map for log FOO-3 complete. 
> (kafka.log.LogCleaner)
> [2020-03-31 13:34:50,927] INFO Cleaner 1: Cleaning log FOO-3 (cleaning prior 
> to Wed Dec 31 19:00:00 EST 1969, discarding tombstones prior to Tue Dec 10 
> 13:39:08 EST 2019)... (kafka.log.LogCleaner)
> [2020-03-31 13:34:50,927] INFO [kafka-log-cleaner-thread-1]: Log cleaner 
> thread 1 cleaned log FOO-3 (dirty section = [55287, 55287])
> 0.0 MB of log processed in 0.0 seconds (0.0 MB/sec).
> Indexed 0.0 MB in 0.0 seconds (0.0 Mb/sec, 100.0% of total time)
> Buffer utilization: 0.0%
> Cleaned 0.0 MB in 0.0 seconds (NaN Mb/sec, 0.0% of total time)
> Start size: 0.0 MB (0 messages)
> End size: 0.0 MB (0 messages) NaN% size reduction (NaN% fewer messages) 
> (kafka.log.LogCleaner)
> ```
> What seems to have happened here (data determined for a different partition) 
> is:
> There exist a number of partitions here which get relatively low traffic, 
> including our friend FOO-5. For whatever reason, LogStartOffset of this 
> partition has moved beyond the baseOffset of the active segment. (Notes in 
> other issues indicate that this is a reasonable scenario.) So there’s one 
> segment, starting at 166266, and a log start of 166301.
> grabFilthiestCompactedLog runs and reads the checkpoint file. We see that 
> this topicpartition needs to be cleaned, and call cleanableOffsets on it 
> which returns an OffsetsToClean with firstDirtyOffset == logStartOffset == 
> 166301 and firstUncleanableOffset = max(logStart, activeSegment.baseOffset) = 
> 116301, and forceCheckpoint = true.
> The checkpoint file is updated in grabFilthiestCompactedLog (this is the fix 
> for KAFKA-6266). We then create a LogToClean object based on the 
> firstDirtyOffset and firstUncleanableOffset of 166301 (past the active 
> segment’s base offset).
> The LogToClean object has cleanBytes = logSegments(-1, 
> firstDirtyOffset).map(_.size).sum → the size of this segment. It has 
> firstUncleanableOffset and cleanableBytes determined by 
> calculateCleanableBytes. calculateCleanableBytes returns:
> {{}}
> {{val firstUncleanableSegment = 
> log.nonActiveLogSegmentsFrom(uncleanableOffset).headOption.getOrElse(log.activeSegment)}}
> {{val firstUncleanableOffset = firstUncleanableSegment.baseOffset}}
> {{val cleanableBytes = log.logSegments(firstDirtyOffset, 
> math.max(firstDirtyOffset, firstUncleanableOffset)).map(_.size.toLong).sum
> (firstUncleanableOffset, cleanableBytes)}}
> firstUncleanableSegment is activeSegment. firstUncleanableOffset is the base 
> offset, 166266. cleanableBytes is looking for logSegments(166301, max(166301, 
> 166266) → which _is the active segment_
> So there are “cleanableBytes” > 0.
> We then filter out segments with totalbytes (clean + cleanable) > 0. This 
> segment has totalBytes > 0, and it has cleanablebytes, so great! It’s 
> filthiest.
> The cleaner picks it, calls cleanLog on it, which then does 

[jira] [Resolved] (KAFKA-9815) Consumer may never re-join if inconsistent metadata is received once

2020-04-06 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson resolved KAFKA-9815.

Fix Version/s: 2.4.2
   2.5.0
   Resolution: Fixed

> Consumer may never re-join if inconsistent metadata is received once
> 
>
> Key: KAFKA-9815
> URL: https://issues.apache.org/jira/browse/KAFKA-9815
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>Priority: Major
> Fix For: 2.5.0, 2.4.2
>
>
> KAFKA-9797 is the result of an incorrect rolling upgrade test where a new 
> listener is added to brokers and set as the inter-broker listener within the 
> same rolling upgrade. As a result, metadata is inconsistent across brokers 
> until the rolling upgrade completes since interbroker communication is broken 
> until all brokers have the new listener. The test fails due to consumer 
> timeouts and sometimes this is because the upgrade takes longer than consumer 
> timeout. But several logs show an issue with the consumer when one metadata 
> response received during upgrade is different from the consumer's cached 
> `assignmentSnapshot`, triggering rebalance.
> In 
> [https://github.com/apache/kafka/blob/7f640f13b4d486477035c3edb28466734f053beb/clients/src/main/java/org/apache/kafka/clients/consumer/internals/ConsumerCoordinator.java#L750,]
>  we return true for `rejoinNeededOrPending()` if `assignmentSnapshot` is not 
> the same as the current `metadataSnapshot`. We don't set `rejoinNeeded` in 
> the instance, but we revoke partitions and send JoinGroup request. If the 
> JoinGroup request fails and a subsequent metadata response contains the same 
> snapshot value as the previously cached `assignmentSnapshot`, we never send 
> `JoinGroup` again since snapshots match and `rejoinNeeded=false`. Partitions 
> are not assigned to the consumer after this and the test fails because 
> messages are not received.
> Even though this particular system test isn't a valid upgrade scenario, we 
> should fix the consumer, since temporary metadata differences between brokers 
> can result in this scenario.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9815) Consumer may never re-join if inconsistent metadata is received once

2020-04-06 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076764#comment-17076764
 ] 

ASF GitHub Bot commented on KAFKA-9815:
---

hachikuji commented on pull request #8420: KAFKA-9815; Ensure consumer always 
re-joins if JoinGroup fails
URL: https://github.com/apache/kafka/pull/8420
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Consumer may never re-join if inconsistent metadata is received once
> 
>
> Key: KAFKA-9815
> URL: https://issues.apache.org/jira/browse/KAFKA-9815
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>Priority: Major
>
> KAFKA-9797 is the result of an incorrect rolling upgrade test where a new 
> listener is added to brokers and set as the inter-broker listener within the 
> same rolling upgrade. As a result, metadata is inconsistent across brokers 
> until the rolling upgrade completes since interbroker communication is broken 
> until all brokers have the new listener. The test fails due to consumer 
> timeouts and sometimes this is because the upgrade takes longer than consumer 
> timeout. But several logs show an issue with the consumer when one metadata 
> response received during upgrade is different from the consumer's cached 
> `assignmentSnapshot`, triggering rebalance.
> In 
> [https://github.com/apache/kafka/blob/7f640f13b4d486477035c3edb28466734f053beb/clients/src/main/java/org/apache/kafka/clients/consumer/internals/ConsumerCoordinator.java#L750,]
>  we return true for `rejoinNeededOrPending()` if `assignmentSnapshot` is not 
> the same as the current `metadataSnapshot`. We don't set `rejoinNeeded` in 
> the instance, but we revoke partitions and send JoinGroup request. If the 
> JoinGroup request fails and a subsequent metadata response contains the same 
> snapshot value as the previously cached `assignmentSnapshot`, we never send 
> `JoinGroup` again since snapshots match and `rejoinNeeded=false`. Partitions 
> are not assigned to the consumer after this and the test fails because 
> messages are not received.
> Even though this particular system test isn't a valid upgrade scenario, we 
> should fix the consumer, since temporary metadata differences between brokers 
> can result in this scenario.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9697) ControlPlaneNetworkProcessorAvgIdlePercent is always NaN

2020-04-06 Thread James Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076763#comment-17076763
 ] 

James Cheng commented on KAFKA-9697:


Oh, no, we don't have that setup. I had assumed that the 
ControlPlaneNetworkProcessorAvgIdlePercent metric would be present, even if 
both the control plane and the data plane shared the same listeners. But 
apparently that is not the case?

Is this purely a documentation issue then? Should we add text saying that the 
ControlPlaneNetworkProcessorAvgIdlePercent metric will be NaN if 
control.plane.listener.name is unset? (Related, I noticed that 
ControlPlaneNetworkProcessorAvgIdlePercent isn't documented on the website)

Or, should we programmatically hide the metric if control.plane.listener.name 
is not yet set?

Or, can we still calculate a value of 
ControlPlaneNetworkProcessorAvgIdlePercent when the control plane and data 
plane share a listener?

 

 

> ControlPlaneNetworkProcessorAvgIdlePercent is always NaN
> 
>
> Key: KAFKA-9697
> URL: https://issues.apache.org/jira/browse/KAFKA-9697
> Project: Kafka
>  Issue Type: Improvement
>  Components: metrics
>Affects Versions: 2.3.0
>Reporter: James Cheng
>Priority: Major
>
> I have a broker running Kafka 2.3.0. The value of 
> kafka.network:type=SocketServer,name=ControlPlaneNetworkProcessorAvgIdlePercent
>  is always "NaN".
> Is that normal, or is there a problem with the metric?
> I am running Kafka 2.3.0. I have not checked this in newer/older versions.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-9826) Log cleaning goes in infinite loop when first dirty offset is past start of active segment

2020-04-06 Thread Steve Rodrigues (Jira)
Steve Rodrigues created KAFKA-9826:
--

 Summary: Log cleaning goes in infinite loop when first dirty 
offset is past start of active segment
 Key: KAFKA-9826
 URL: https://issues.apache.org/jira/browse/KAFKA-9826
 Project: Kafka
  Issue Type: Bug
  Components: log cleaner
Affects Versions: 2.4.1
Reporter: Steve Rodrigues
Assignee: Steve Rodrigues


Seen on a system where a given partition had a single segment, and for whatever 
reason (deleteRecords?), the logStartOffset was greater than the base segment 
of the log, there were a continuous series of 

```

[2020-03-03 16:56:31,374] WARN Resetting first dirty offset of  FOO-3 to log 
start offset 55649 since the checkpointed offset 0 is invalid. 
(kafka.log.LogCleanerManager$)

```

messages (partition name changed, it wasn't really FOO). This was expected to 
be resolved by KAFKA-6266 but clearly wasn't. 

Further investigation revealed that  a few segments were continuously cleaning 
and generating messages in the `log-cleaner.log` of the form:

```

[2020-03-31 13:34:50,924] INFO Cleaner 1: Beginning cleaning of log FOO-3 
(kafka.log.LogCleaner)

[2020-03-31 13:34:50,924] INFO Cleaner 1: Building offset map for FOO-3... 
(kafka.log.LogCleaner)

[2020-03-31 13:34:50,927] INFO Cleaner 1: Building offset map for log FOO-3 for 
0 segments in offset range [55287, 54237). (kafka.log.LogCleaner)

[2020-03-31 13:34:50,927] INFO Cleaner 1: Offset map for log FOO-3 complete. 
(kafka.log.LogCleaner)

[2020-03-31 13:34:50,927] INFO Cleaner 1: Cleaning log FOO-3 (cleaning prior to 
Wed Dec 31 19:00:00 EST 1969, discarding tombstones prior to Tue Dec 10 
13:39:08 EST 2019)... (kafka.log.LogCleaner)

[2020-03-31 13:34:50,927] INFO [kafka-log-cleaner-thread-1]: Log cleaner thread 
1 cleaned log FOO-3 (dirty section = [55287, 55287])

0.0 MB of log processed in 0.0 seconds (0.0 MB/sec).

Indexed 0.0 MB in 0.0 seconds (0.0 Mb/sec, 100.0% of total time)

Buffer utilization: 0.0%

Cleaned 0.0 MB in 0.0 seconds (NaN Mb/sec, 0.0% of total time)

Start size: 0.0 MB (0 messages)

End size: 0.0 MB (0 messages) NaN% size reduction (NaN% fewer messages) 
(kafka.log.LogCleaner)

```

What seems to have happened here (data determined for a different partition) is:

There exist a number of partitions here which get relatively low traffic, 
including our friend FOO-5. For whatever reason, LogStartOffset of this 
partition has moved beyond the baseOffset of the active segment. (Notes in 
other issues indicate that this is a reasonable scenario.) So there’s one 
segment, starting at 166266, and a log start of 166301.

grabFilthiestCompactedLog runs and reads the checkpoint file. We see that this 
topicpartition needs to be cleaned, and call cleanableOffsets on it which 
returns an OffsetsToClean with firstDirtyOffset == logStartOffset == 166301 and 
firstUncleanableOffset = max(logStart, activeSegment.baseOffset) = 116301, and 
forceCheckpoint = true.

The checkpoint file is updated in grabFilthiestCompactedLog (this is the fix 
for KAFKA-6266). We then create a LogToClean object based on the 
firstDirtyOffset and firstUncleanableOffset of 166301 (past the active 
segment’s base offset).

The LogToClean object has cleanBytes = logSegments(-1, 
firstDirtyOffset).map(_.size).sum → the size of this segment. It has 
firstUncleanableOffset and cleanableBytes determined by 
calculateCleanableBytes. calculateCleanableBytes returns:
{{}}
{{val firstUncleanableSegment = 
log.nonActiveLogSegmentsFrom(uncleanableOffset).headOption.getOrElse(log.activeSegment)}}
{{val firstUncleanableOffset = firstUncleanableSegment.baseOffset}}
{{val cleanableBytes = log.logSegments(firstDirtyOffset, 
math.max(firstDirtyOffset, firstUncleanableOffset)).map(_.size.toLong).sum

(firstUncleanableOffset, cleanableBytes)}}
firstUncleanableSegment is activeSegment. firstUncleanableOffset is the base 
offset, 166266. cleanableBytes is looking for logSegments(166301, max(166301, 
166266) → which _is the active segment_

So there are “cleanableBytes” > 0.

We then filter out segments with totalbytes (clean + cleanable) > 0. This 
segment has totalBytes > 0, and it has cleanablebytes, so great! It’s filthiest.

The cleaner picks it, calls cleanLog on it, which then does cleaner.clean, 
which returns nextDirtyOffset and cleaner stats. cleaner.clean callls doClean, 
which builds an offsetMap. The offsetMap looks at non-active segments, when 
building, but there aren’t any. So the endOffset of the offsetMap is lastOffset 
(default -1) + 1 → 0. We record the stats (including logging to 
log-cleaner.log). After this we call cleanerManager.doneCleaning, which writes 
the checkpoint file with the latest value… of 0.

And then the process starts all over.

It appears that there's at least one bug here, where `log.logSegments(from, 
to)` will return an empty list if from == to and both are 

[jira] [Commented] (KAFKA-9753) Add task-level active-process-ratio to Streams metrics

2020-04-06 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076721#comment-17076721
 ] 

ASF GitHub Bot commented on KAFKA-9753:
---

guozhangwang commented on pull request #8371: KAFKA-9753: A few more metrics to 
add
URL: https://github.com/apache/kafka/pull/8371
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add task-level active-process-ratio to Streams metrics
> --
>
> Key: KAFKA-9753
> URL: https://issues.apache.org/jira/browse/KAFKA-9753
> Project: Kafka
>  Issue Type: New Feature
>  Components: streams
>Reporter: Guozhang Wang
>Assignee: Guozhang Wang
>Priority: Major
> Fix For: 2.6.0
>
>
> This is described as part of KIP-444 (which is mostly done in 2.4 / 2.5).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-9751) Admin `FindCoordinator` call should go to controller instead of ZK

2020-04-06 Thread Boyang Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Chen updated KAFKA-9751:
---
Parent: (was: KAFKA-9119)
Issue Type: Improvement  (was: Sub-task)

> Admin `FindCoordinator` call should go to controller instead of ZK
> --
>
> Key: KAFKA-9751
> URL: https://issues.apache.org/jira/browse/KAFKA-9751
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Boyang Chen
>Priority: Major
>
> In current trunk, we are still going to use ZK for topic creation in the 
> routing of FindCoordinatorRequest:
>  val (partition, topicMetadata) = 
> CoordinatorType.forId(findCoordinatorRequest.data.keyType) match {
>         case CoordinatorType.GROUP =>
>           val partition = 
> groupCoordinator.partitionFor(findCoordinatorRequest.data.key)
>           val metadata = getOrCreateInternalTopic(GROUP_METADATA_TOPIC_NAME, 
> request.context.listenerName)
>           (partition, metadata)
> Which should be migrated to controller handling



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-9751) Admin `FindCoordinator` call should go to controller instead of ZK

2020-04-06 Thread Boyang Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boyang Chen updated KAFKA-9751:
---
Parent: KAFKA-9705
Issue Type: Sub-task  (was: Improvement)

> Admin `FindCoordinator` call should go to controller instead of ZK
> --
>
> Key: KAFKA-9751
> URL: https://issues.apache.org/jira/browse/KAFKA-9751
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Boyang Chen
>Priority: Major
>
> In current trunk, we are still going to use ZK for topic creation in the 
> routing of FindCoordinatorRequest:
>  val (partition, topicMetadata) = 
> CoordinatorType.forId(findCoordinatorRequest.data.keyType) match {
>         case CoordinatorType.GROUP =>
>           val partition = 
> groupCoordinator.partitionFor(findCoordinatorRequest.data.key)
>           val metadata = getOrCreateInternalTopic(GROUP_METADATA_TOPIC_NAME, 
> request.context.listenerName)
>           (partition, metadata)
> Which should be migrated to controller handling



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9543) Consumer offset reset after new segment rolling

2020-04-06 Thread Alexis Seigneurin (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076676#comment-17076676
 ] 

Alexis Seigneurin commented on KAFKA-9543:
--

We have been seeing the same issue here with Kafka 2.4.0.

Here is an example: at 2020-04-05 06:54:15.746, the offsets recorded on the 
consumers were between 623860252 and 623869089, and we received a record with 
offset 405637478. These offsets and times coincide with logs seen on the 
brokers (see attached screenshot).

!image-2020-04-06-17-10-32-636.png!

> Consumer offset reset after new segment rolling
> ---
>
> Key: KAFKA-9543
> URL: https://issues.apache.org/jira/browse/KAFKA-9543
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.4.0
>Reporter: Rafał Boniecki
>Priority: Major
> Attachments: Untitled.png, image-2020-04-06-17-10-32-636.png
>
>
> After upgrade from kafka 2.1.1 to 2.4.0, I'm experiencing unexpected consumer 
> offset resets.
> Consumer:
> {code:java}
> 2020-02-12T11:12:58.402+01:00 hostname 4a2a39a35a02 
> [2020-02-12T11:12:58,402][INFO 
> ][org.apache.kafka.clients.consumer.internals.Fetcher] [Consumer 
> clientId=logstash-1, groupId=logstash] Fetch offset 1632750575 is out of 
> range for partition stats-5, resetting offset
> {code}
> Broker:
> {code:java}
> 2020-02-12 11:12:58:400 CET INFO  
> [data-plane-kafka-request-handler-1][kafka.log.Log] [Log partition=stats-5, 
> dir=/kafka4/data] Rolled new log segment at offset 1632750565 in 2 ms.{code}
> All resets are perfectly correlated to rolling new segments at the broker - 
> segment is rolled first, then, couple of ms later, reset on the consumer 
> occurs. Attached is grafana graph with consumer lag per partition. All sudden 
> spikes in lag are offset resets due to this bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-9543) Consumer offset reset after new segment rolling

2020-04-06 Thread Alexis Seigneurin (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexis Seigneurin updated KAFKA-9543:
-
Attachment: image-2020-04-06-17-10-32-636.png

> Consumer offset reset after new segment rolling
> ---
>
> Key: KAFKA-9543
> URL: https://issues.apache.org/jira/browse/KAFKA-9543
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.4.0
>Reporter: Rafał Boniecki
>Priority: Major
> Attachments: Untitled.png, image-2020-04-06-17-10-32-636.png
>
>
> After upgrade from kafka 2.1.1 to 2.4.0, I'm experiencing unexpected consumer 
> offset resets.
> Consumer:
> {code:java}
> 2020-02-12T11:12:58.402+01:00 hostname 4a2a39a35a02 
> [2020-02-12T11:12:58,402][INFO 
> ][org.apache.kafka.clients.consumer.internals.Fetcher] [Consumer 
> clientId=logstash-1, groupId=logstash] Fetch offset 1632750575 is out of 
> range for partition stats-5, resetting offset
> {code}
> Broker:
> {code:java}
> 2020-02-12 11:12:58:400 CET INFO  
> [data-plane-kafka-request-handler-1][kafka.log.Log] [Log partition=stats-5, 
> dir=/kafka4/data] Rolled new log segment at offset 1632750565 in 2 ms.{code}
> All resets are perfectly correlated to rolling new segments at the broker - 
> segment is rolled first, then, couple of ms later, reset on the consumer 
> occurs. Attached is grafana graph with consumer lag per partition. All sudden 
> spikes in lag are offset resets due to this bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9438) Issue with mm2 active/active replication

2020-04-06 Thread Rens Groothuijsen (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076672#comment-17076672
 ] 

Rens Groothuijsen commented on KAFKA-9438:
--

[~romanius11] Tried to replicate this issue, though the only scenario in which 
I managed to do so was one where the MM instance accidentally looped back to 
the source cluster rather than writing to the target cluster.

> Issue with mm2 active/active replication
> 
>
> Key: KAFKA-9438
> URL: https://issues.apache.org/jira/browse/KAFKA-9438
> Project: Kafka
>  Issue Type: Bug
>  Components: mirrormaker
>Affects Versions: 2.4.0
>Reporter: Roman
>Priority: Minor
>
> Hi,
>  
> i am trying to configure the the active/active with new kafka 2.4.0 and MM2.
> I have 2 kafka clusters of 3 kafkas, one in lets say BO and on in IN.
> In each cluster there are 3 kafkas.
> Topics are replicated properly so in BO i see
> {quote}topics
> in.topics
> {quote}
>  
> in IN i see
> {quote}topics
> bo.topic
> {quote}
>  
> That should be according to documentation.
>  
> But when I stop the replication process on one data center and start it up, 
> the replication replicate the topics with the same prefix twice bo.bo.topics 
> or in.in.topics depending on what connector i restart.
> I have also blacklisted the topics but they are still replicating.
>  
> bo.properties file
> {quote}name = in-bo
>  #topics = .*
>  topics.blacklist = "bo.*"
>  #groups = .*
>  connector.class = org.apache.kafka.connect.mirror.MirrorSourceConnector
>  tasks.max = 10
> source.cluster.alias = in
>  target.cluster.alias = bo
>  source.cluster.bootstrap.servers = 
> IN-kafka1:9092,IN-kafka2:9092,IN-kafka3:9092
>  target.cluster.bootstrap.servers = 
> BO-kafka1:9092,BO-kafka2:9092,bo-kafka39092
> use ByteArrayConverter to ensure that records are not re-encoded
>  key.converter = org.apache.kafka.connect.converters.ByteArrayConverter
>  value.converter = org.apache.kafka.connect.converters.ByteArrayConverter
>  
> {quote}
> in.properties
> {quote}name = bo-in
>  #topics = .*
>  topics.blacklist = "in.*"
>  #groups = .*
>  connector.class = org.apache.kafka.connect.mirror.MirrorSourceConnector
>  tasks.max = 10
> source.cluster.alias = bo
>  target.cluster.alias = in
>  target.cluster.bootstrap.servers = 
> IN-kafka1:9092,IN-kafka2:9092,IN-kafka3:9092
>  source.cluster.bootstrap.servers = 
> BO-kafka1:9092,BO-kafka2:9092,BO-kafka3:9092
> use ByteArrayConverter to ensure that records are not re-encoded
>  key.converter = org.apache.kafka.connect.converters.ByteArrayConverter
>  value.converter = org.apache.kafka.connect.converters.ByteArrayConverter
> {quote}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-9825) Kafka protocol BNF format should have some way to display tagged fields

2020-04-06 Thread Colin McCabe (Jira)
Colin McCabe created KAFKA-9825:
---

 Summary: Kafka protocol BNF format should have some way to display 
tagged fields
 Key: KAFKA-9825
 URL: https://issues.apache.org/jira/browse/KAFKA-9825
 Project: Kafka
  Issue Type: Improvement
Reporter: Colin McCabe


The Kafka protocol BNF format should have some way to display tagged fields.  
But in clients/src/main/java/org/apache/kafka/common/protocol/Protocol.java , 
there is no special treatment for fields with a tag.  Maybe something like 
FIELD_NAME<1> (where 1= the tag number) would work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-9825) Kafka protocol BNF format should have some way to display tagged fields

2020-04-06 Thread Colin McCabe (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin McCabe updated KAFKA-9825:

Labels: newbie  (was: )

> Kafka protocol BNF format should have some way to display tagged fields
> ---
>
> Key: KAFKA-9825
> URL: https://issues.apache.org/jira/browse/KAFKA-9825
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Colin McCabe
>Priority: Major
>  Labels: newbie
>
> The Kafka protocol BNF format should have some way to display tagged fields.  
> But in clients/src/main/java/org/apache/kafka/common/protocol/Protocol.java , 
> there is no special treatment for fields with a tag.  Maybe something like 
> FIELD_NAME<1> (where 1= the tag number) would work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9770) Caching State Store does not Close Underlying State Store When Exception is Thrown During Flushing

2020-04-06 Thread John Roesler (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076615#comment-17076615
 ] 

John Roesler commented on KAFKA-9770:
-

Thanks, [~mjsax] !

> Caching State Store does not Close Underlying State Store When Exception is 
> Thrown During Flushing
> --
>
> Key: KAFKA-9770
> URL: https://issues.apache.org/jira/browse/KAFKA-9770
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.0.0
>Reporter: Bruno Cadonna
>Assignee: Bruno Cadonna
>Priority: Major
> Fix For: 2.5.0
>
>
> When a caching state store is closed it calls its {{flush()}} method. If 
> {{flush()}} throws an exception the underlying state store is not closed.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9770) Caching State Store does not Close Underlying State Store When Exception is Thrown During Flushing

2020-04-06 Thread Matthias J. Sax (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076614#comment-17076614
 ] 

Matthias J. Sax commented on KAFKA-9770:


The commit was cherry-picked to 2.5 and there will be a new RC. Updated the 
"fixed version" to 2.5 and resolved the ticket.

> Caching State Store does not Close Underlying State Store When Exception is 
> Thrown During Flushing
> --
>
> Key: KAFKA-9770
> URL: https://issues.apache.org/jira/browse/KAFKA-9770
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.0.0
>Reporter: Bruno Cadonna
>Assignee: Bruno Cadonna
>Priority: Major
> Fix For: 2.5.0
>
>
> When a caching state store is closed it calls its {{flush()}} method. If 
> {{flush()}} throws an exception the underlying state store is not closed.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-9770) Caching State Store does not Close Underlying State Store When Exception is Thrown During Flushing

2020-04-06 Thread Matthias J. Sax (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias J. Sax updated KAFKA-9770:
---
Fix Version/s: (was: 2.6.0)

> Caching State Store does not Close Underlying State Store When Exception is 
> Thrown During Flushing
> --
>
> Key: KAFKA-9770
> URL: https://issues.apache.org/jira/browse/KAFKA-9770
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.0.0
>Reporter: Bruno Cadonna
>Assignee: Bruno Cadonna
>Priority: Major
> Fix For: 2.5.0
>
>
> When a caching state store is closed it calls its {{flush()}} method. If 
> {{flush()}} throws an exception the underlying state store is not closed.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9798) org.apache.kafka.streams.integration.QueryableStateIntegrationTest.shouldAllowConcurrentAccesses

2020-04-06 Thread Matthias J. Sax (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076606#comment-17076606
 ] 

Matthias J. Sax commented on KAFKA-9798:


[https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/1624/testReport/junit/org.apache.kafka.streams.integration/QueryableStateIntegrationTest/shouldAllowConcurrentAccesses/]

Different error message (Seem the hotfix did not do? The PR should contain the 
hotfix \cc [~guozhang] ):
{quote}java.nio.file.DirectoryNotEmptyException: 
/tmp/state-queryable-state-18623682413158663595/queryable-state-1/1_0 at 
sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:242) 
at 
sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103)
 at java.nio.file.Files.delete(Files.java:1126) at 
org.apache.kafka.common.utils.Utils$2.postVisitDirectory(Utils.java:802) at 
org.apache.kafka.common.utils.Utils$2.postVisitDirectory(Utils.java:772) at 
java.nio.file.Files.walkFileTree(Files.java:2688) at 
java.nio.file.Files.walkFileTree(Files.java:2742) at 
org.apache.kafka.common.utils.Utils.delete(Utils.java:772) at 
org.apache.kafka.common.utils.Utils.delete(Utils.java:758) at 
org.apache.kafka.streams.integration.utils.IntegrationTestUtils.purgeLocalStreamsState(IntegrationTestUtils.java:125)
 at 
org.apache.kafka.streams.integration.QueryableStateIntegrationTest.shutdown(QueryableStateIntegrationTest.java:225){quote}

> org.apache.kafka.streams.integration.QueryableStateIntegrationTest.shouldAllowConcurrentAccesses
> 
>
> Key: KAFKA-9798
> URL: https://issues.apache.org/jira/browse/KAFKA-9798
> Project: Kafka
>  Issue Type: Test
>  Components: streams, unit tests
>Reporter: Bill Bejeck
>Priority: Major
>  Labels: flaky-test, test
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9697) ControlPlaneNetworkProcessorAvgIdlePercent is always NaN

2020-04-06 Thread Chia-Ping Tsai (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076538#comment-17076538
 ] 

Chia-Ping Tsai commented on KAFKA-9697:
---

Did you config control.plane.listener.name? If not, the value is always NaN.

> ControlPlaneNetworkProcessorAvgIdlePercent is always NaN
> 
>
> Key: KAFKA-9697
> URL: https://issues.apache.org/jira/browse/KAFKA-9697
> Project: Kafka
>  Issue Type: Improvement
>  Components: metrics
>Affects Versions: 2.3.0
>Reporter: James Cheng
>Priority: Major
>
> I have a broker running Kafka 2.3.0. The value of 
> kafka.network:type=SocketServer,name=ControlPlaneNetworkProcessorAvgIdlePercent
>  is always "NaN".
> Is that normal, or is there a problem with the metric?
> I am running Kafka 2.3.0. I have not checked this in newer/older versions.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9824) Consumer loses partition offset and resets post 2.4.1 version upgrade

2020-04-06 Thread Jason Gustafson (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076516#comment-17076516
 ] 

Jason Gustafson commented on KAFKA-9824:


[~nitay.k] Thanks for the report. From the logs, we saw that a segment was 
rolled at offset 1222791065, which is just after the out of range error at 
offset 1222791071 (if we trust timestamps). It would be useful to know if there 
was a leader change around this time. The messages about the coordinator 
unloading suggests that there might have been. Could you search for these two 
messages in the logs for partition trackingSolutionAttribution-48?

{code}
  info(s"$topicPartition starts at Leader Epoch 
${partitionStateInfo.basePartitionState.leaderEpoch} from " +
s"offset $leaderEpochStartOffset. Previous Leader Epoch was: 
$leaderEpoch")
{code}

and 

{code}
info(s"Truncating to offset $targetOffset")
{code}

> Consumer loses partition offset and resets post 2.4.1 version upgrade
> -
>
> Key: KAFKA-9824
> URL: https://issues.apache.org/jira/browse/KAFKA-9824
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.4.1
>Reporter: Nitay Kufert
>Priority: Major
> Attachments: image-2020-04-06-13-14-47-014.png
>
>
> Hello,
>  around 2 weeks ago we upgraded our Kafka clients & brokers to 2.4.1 (from 
> 2.3.1), 
>  and we started noticing a troubling behavior that we didn't see before:
>   
>  Without apparent reason, a specific partition on a specific consumer loses 
> its offset and start re-consuming the entire partition from the beginning 
> (according to the retention).
>   
>  Messages appearing on the consumer (client):
> {quote}Apr 5, 2020 @ 14:54:47.327 INFO sonic-fire-attribution [Consumer 
> clientId=consumer-fireAttributionConsumerGroup4-2, 
> groupId=fireAttributionConsumerGroup4] Resetting offset for partition 
> trackingSolutionAttribution-48 to offset 1216430527.
> {quote}
> {quote}Apr 5, 2020 @ 14:54:46.797 INFO sonic-fire-attribution [Consumer 
> clientId=consumer-fireAttributionConsumerGroup4-2, 
> groupId=fireAttributionConsumerGroup4] Fetch offset 1222791071 is out of 
> range for partition trackingSolutionAttribution-48
> {quote}
> Those are the logs from the brokers at the same time (searched for 
> "trackingSolutionAttribution-48" OR "fireAttributionConsumerGroup4")
> {quote}Apr 5, 2020 @ 14:54:46.801 INFO Writing producer snapshot at offset 
> 1222791065
>   
>  Apr 5, 2020 @ 14:54:46.801 INFO Writing producer snapshot at offset 
> 1222791065
>   
>  Apr 5, 2020 @ 14:54:46.801 INFO Rolled new log segment at offset 1222791065 
> in 0 ms.
>   
>  Apr 5, 2020 @ 14:54:04.400 INFO BrokerId 1033 is no longer a coordinator for 
> the group fireAttributionConsumerGroup4. Proceeding cleanup for other alive 
> groups
>   
>  Apr 5, 2020 @ 14:54:04.400 INFO BrokerId 1033 is no longer a coordinator for 
> the group fireAttributionConsumerGroup4. Proceeding cleanup for other alive 
> groups
> {quote}
> Another way to see the same thing, from our monitoring (DD) on the partition 
> offset:
> !image-2020-04-06-13-14-47-014.png|width=530,height=152!
> The recovery you are seeing is after I run partition offset reset manually 
> (using kafka-consumer-groups.sh --bootstrap-server localhost:9092 --topic 
> trackingSolutionAttribution:57 --group fireAttributionConsumerGroup4 
> --reset-offsets --to-datetime 'SOME DATE')
>   
>  Any idea what can be causing this? we have it happen to us at least 5 times 
> since the upgrade, and before that, I don't remember it ever happening to us.
>   
>  Topic config is set to default, except the retention, which is manually set 
> to 4320.
>  The topic has 60 partitions & a replication factor of 2. 
>   
>  Consumer config:
> {code:java}
> ConsumerConfig values:
>   allow.auto.create.topics = true
>   auto.commit.interval.ms = 5000
>   auto.offset.reset = earliest
>   bootstrap.servers = [..]
>   check.crcs = true
>   client.dns.lookup = default
>   client.id =
>   client.rack =
>   connections.max.idle.ms = 54
>   default.api.timeout.ms = 6
>   enable.auto.commit = true
>   exclude.internal.topics = true
>   fetch.max.bytes = 52428800
>   fetch.max.wait.ms = 500
>   fetch.min.bytes = 1
>   group.id = fireAttributionConsumerGroup4
>   group.instance.id = null
>   heartbeat.interval.ms = 1
>   interceptor.classes = []
>   internal.leave.group.on.close = true
>   isolation.level = read_uncommitted
>   key.deserializer = class 
> org.apache.kafka.common.serialization.ByteArrayDeserializer
>   max.partition.fetch.bytes = 1048576
>   max.poll.interval.ms = 30
>   max.poll.records = 500
>   metadata.max.age.ms = 30
>   

[jira] [Commented] (KAFKA-9824) Consumer loses partition offset and resets post 2.4.1 version upgrade

2020-04-06 Thread Boyang Chen (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076491#comment-17076491
 ] 

Boyang Chen commented on KAFKA-9824:


Thanks for reporting. Does the lost partition happen after each rebalance?

> Consumer loses partition offset and resets post 2.4.1 version upgrade
> -
>
> Key: KAFKA-9824
> URL: https://issues.apache.org/jira/browse/KAFKA-9824
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.4.1
>Reporter: Nitay Kufert
>Priority: Major
> Attachments: image-2020-04-06-13-14-47-014.png
>
>
> Hello,
>  around 2 weeks ago we upgraded our Kafka clients & brokers to 2.4.1 (from 
> 2.3.1), 
>  and we started noticing a troubling behavior that we didn't see before:
>   
>  Without apparent reason, a specific partition on a specific consumer loses 
> its offset and start re-consuming the entire partition from the beginning 
> (according to the retention).
>   
>  Messages appearing on the consumer (client):
> {quote}Apr 5, 2020 @ 14:54:47.327 INFO sonic-fire-attribution [Consumer 
> clientId=consumer-fireAttributionConsumerGroup4-2, 
> groupId=fireAttributionConsumerGroup4] Resetting offset for partition 
> trackingSolutionAttribution-48 to offset 1216430527.
> {quote}
> {quote}Apr 5, 2020 @ 14:54:46.797 INFO sonic-fire-attribution [Consumer 
> clientId=consumer-fireAttributionConsumerGroup4-2, 
> groupId=fireAttributionConsumerGroup4] Fetch offset 1222791071 is out of 
> range for partition trackingSolutionAttribution-48
> {quote}
> Those are the logs from the brokers at the same time (searched for 
> "trackingSolutionAttribution-48" OR "fireAttributionConsumerGroup4")
> {quote}Apr 5, 2020 @ 14:54:46.801 INFO Writing producer snapshot at offset 
> 1222791065
>   
>  Apr 5, 2020 @ 14:54:46.801 INFO Writing producer snapshot at offset 
> 1222791065
>   
>  Apr 5, 2020 @ 14:54:46.801 INFO Rolled new log segment at offset 1222791065 
> in 0 ms.
>   
>  Apr 5, 2020 @ 14:54:04.400 INFO BrokerId 1033 is no longer a coordinator for 
> the group fireAttributionConsumerGroup4. Proceeding cleanup for other alive 
> groups
>   
>  Apr 5, 2020 @ 14:54:04.400 INFO BrokerId 1033 is no longer a coordinator for 
> the group fireAttributionConsumerGroup4. Proceeding cleanup for other alive 
> groups
> {quote}
> Another way to see the same thing, from our monitoring (DD) on the partition 
> offset:
> !image-2020-04-06-13-14-47-014.png|width=530,height=152!
> The recovery you are seeing is after I run partition offset reset manually 
> (using kafka-consumer-groups.sh --bootstrap-server localhost:9092 --topic 
> trackingSolutionAttribution:57 --group fireAttributionConsumerGroup4 
> --reset-offsets --to-datetime 'SOME DATE')
>   
>  Any idea what can be causing this? we have it happen to us at least 5 times 
> since the upgrade, and before that, I don't remember it ever happening to us.
>   
>  Topic config is set to default, except the retention, which is manually set 
> to 4320.
>  The topic has 60 partitions & a replication factor of 2. 
>   
>  Consumer config:
> {code:java}
> ConsumerConfig values:
>   allow.auto.create.topics = true
>   auto.commit.interval.ms = 5000
>   auto.offset.reset = earliest
>   bootstrap.servers = [..]
>   check.crcs = true
>   client.dns.lookup = default
>   client.id =
>   client.rack =
>   connections.max.idle.ms = 54
>   default.api.timeout.ms = 6
>   enable.auto.commit = true
>   exclude.internal.topics = true
>   fetch.max.bytes = 52428800
>   fetch.max.wait.ms = 500
>   fetch.min.bytes = 1
>   group.id = fireAttributionConsumerGroup4
>   group.instance.id = null
>   heartbeat.interval.ms = 1
>   interceptor.classes = []
>   internal.leave.group.on.close = true
>   isolation.level = read_uncommitted
>   key.deserializer = class 
> org.apache.kafka.common.serialization.ByteArrayDeserializer
>   max.partition.fetch.bytes = 1048576
>   max.poll.interval.ms = 30
>   max.poll.records = 500
>   metadata.max.age.ms = 30
>   metric.reporters = []
>   metrics.num.samples = 2
>   metrics.recording.level = INFO
>   metrics.sample.window.ms = 3
>   partition.assignment.strategy = [class 
> org.apache.kafka.clients.consumer.RangeAssignor]
>   receive.buffer.bytes = 65536
>   reconnect.backoff.max.ms = 1000
>   reconnect.backoff.ms = 50
>   request.timeout.ms = 3
>   retry.backoff.ms = 100
>   sasl.client.callback.handler.class = null
>   sasl.jaas.config = null
>   sasl.kerberos.kinit.cmd = /usr/bin/kinit
>   sasl.kerberos.min.time.before.relogin = 6
>   sasl.kerberos.service.name = null
>   sasl.kerberos.ticket.renew.jitter = 

[jira] [Commented] (KAFKA-3824) Docs indicate auto.commit breaks at least once delivery but that is incorrect

2020-04-06 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076359#comment-17076359
 ] 

ASF GitHub Bot commented on KAFKA-3824:
---

amanullah92 commented on pull request #8194: KAFKA-3824 | MINOR | Trying to 
make autocommit behavior clear in the config s…
URL: https://github.com/apache/kafka/pull/8194
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Docs indicate auto.commit breaks at least once delivery but that is incorrect
> -
>
> Key: KAFKA-3824
> URL: https://issues.apache.org/jira/browse/KAFKA-3824
> Project: Kafka
>  Issue Type: Improvement
>  Components: consumer
>Affects Versions: 0.10.0.0
>Reporter: Jay Kreps
>Assignee: Jason Gustafson
>Priority: Major
>  Labels: newbie
> Fix For: 0.10.1.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The javadocs for the new consumer indicate that auto commit breaks at least 
> once delivery. This is no longer correct as of 0.10. 
> http://kafka.apache.org/0100/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-9824) Consumer loses partition offset and resets post 2.4.1 version upgrade

2020-04-06 Thread Nitay Kufert (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nitay Kufert updated KAFKA-9824:

Description: 
Hello,
 around 2 weeks ago we upgraded our Kafka clients & brokers to 2.4.1 (from 
2.3.1), 
 and we started noticing a troubling behavior that we didn't see before:
  
 Without apparent reason, a specific partition on a specific consumer loses its 
offset and start re-consuming the entire partition from the beginning 
(according to the retention).
  
 Messages appearing on the consumer (client):
{quote}Apr 5, 2020 @ 14:54:47.327 INFO sonic-fire-attribution [Consumer 
clientId=consumer-fireAttributionConsumerGroup4-2, 
groupId=fireAttributionConsumerGroup4] Resetting offset for partition 
trackingSolutionAttribution-48 to offset 1216430527.
{quote}
{quote}Apr 5, 2020 @ 14:54:46.797 INFO sonic-fire-attribution [Consumer 
clientId=consumer-fireAttributionConsumerGroup4-2, 
groupId=fireAttributionConsumerGroup4] Fetch offset 1222791071 is out of range 
for partition trackingSolutionAttribution-48
{quote}
Those are the logs from the brokers at the same time (searched for 
"trackingSolutionAttribution-48" OR "fireAttributionConsumerGroup4")
{quote}Apr 5, 2020 @ 14:54:46.801 INFO Writing producer snapshot at offset 
1222791065
  
 Apr 5, 2020 @ 14:54:46.801 INFO Writing producer snapshot at offset 1222791065
  
 Apr 5, 2020 @ 14:54:46.801 INFO Rolled new log segment at offset 1222791065 in 
0 ms.
  
 Apr 5, 2020 @ 14:54:04.400 INFO BrokerId 1033 is no longer a coordinator for 
the group fireAttributionConsumerGroup4. Proceeding cleanup for other alive 
groups
  
 Apr 5, 2020 @ 14:54:04.400 INFO BrokerId 1033 is no longer a coordinator for 
the group fireAttributionConsumerGroup4. Proceeding cleanup for other alive 
groups
{quote}
Another way to see the same thing, from our monitoring (DD) on the partition 
offset:

!image-2020-04-06-13-14-47-014.png|width=530,height=152!

The recovery you are seeing is after I run partition offset reset manually 
(using kafka-consumer-groups.sh --bootstrap-server localhost:9092 --topic 
trackingSolutionAttribution:57 --group fireAttributionConsumerGroup4 
--reset-offsets --to-datetime 'SOME DATE')
  
 Any idea what can be causing this? we have it happen to us at least 5 times 
since the upgrade, and before that, I don't remember it ever happening to us.
  
 Topic config is set to default, except the retention, which is manually set to 
4320.
 The topic has 60 partitions & a replication factor of 2. 
  
 Consumer config:
{code:java}
ConsumerConfig values:
allow.auto.create.topics = true
auto.commit.interval.ms = 5000
auto.offset.reset = earliest
bootstrap.servers = [..]
check.crcs = true
client.dns.lookup = default
client.id =
client.rack =
connections.max.idle.ms = 54
default.api.timeout.ms = 6
enable.auto.commit = true
exclude.internal.topics = true
fetch.max.bytes = 52428800
fetch.max.wait.ms = 500
fetch.min.bytes = 1
group.id = fireAttributionConsumerGroup4
group.instance.id = null
heartbeat.interval.ms = 1
interceptor.classes = []
internal.leave.group.on.close = true
isolation.level = read_uncommitted
key.deserializer = class 
org.apache.kafka.common.serialization.ByteArrayDeserializer
max.partition.fetch.bytes = 1048576
max.poll.interval.ms = 30
max.poll.records = 500
metadata.max.age.ms = 30
metric.reporters = []
metrics.num.samples = 2
metrics.recording.level = INFO
metrics.sample.window.ms = 3
partition.assignment.strategy = [class 
org.apache.kafka.clients.consumer.RangeAssignor]
receive.buffer.bytes = 65536
reconnect.backoff.max.ms = 1000
reconnect.backoff.ms = 50
request.timeout.ms = 3
retry.backoff.ms = 100
sasl.client.callback.handler.class = null
sasl.jaas.config = null
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.min.time.before.relogin = 6
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
sasl.kerberos.ticket.renew.window.factor = 0.8
sasl.login.callback.handler.class = null
sasl.login.class = null
sasl.login.refresh.buffer.seconds = 300
sasl.login.refresh.min.period.seconds = 60
sasl.login.refresh.window.factor = 0.8
sasl.login.refresh.window.jitter = 0.05
sasl.mechanism = GSSAPI
security.protocol = PLAINTEXT
security.providers = null
send.buffer.bytes = 131072
session.timeout.ms = 3
ssl.cipher.suites = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
ssl.endpoint.identification.algorithm = https

[jira] [Updated] (KAFKA-9824) Consumer loses partition offset and resets post 2.4.1 version upgrade

2020-04-06 Thread Nitay Kufert (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nitay Kufert updated KAFKA-9824:

Attachment: image-2020-04-06-13-14-47-014.png

> Consumer loses partition offset and resets post 2.4.1 version upgrade
> -
>
> Key: KAFKA-9824
> URL: https://issues.apache.org/jira/browse/KAFKA-9824
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.4.1
>Reporter: Nitay Kufert
>Priority: Major
> Attachments: image-2020-04-06-13-14-47-014.png
>
>
> Hello,
> around 2 weeks ago we upgraded our Kafka clients & brokers to 2.4.1 (from 
> 2.3.1), 
> and we started noticing a troubling behavior that we didn't see before:
>  
> Without apparent reason, a specific partition on a specific consumer loses 
> its offset and start re-consuming the entire partition from the beginning 
> (according to the retention).
>  
> Messages appearing on the consumer (client):
> {quote}Apr 5, 2020 @ 14:54:47.327 INFO sonic-fire-attribution [Consumer 
> clientId=consumer-fireAttributionConsumerGroup4-2, 
> groupId=fireAttributionConsumerGroup4] Resetting offset for partition 
> trackingSolutionAttribution-48 to offset 1216430527.{quote}
> {quote}Apr 5, 2020 @ 14:54:46.797 INFO sonic-fire-attribution [Consumer 
> clientId=consumer-fireAttributionConsumerGroup4-2, 
> groupId=fireAttributionConsumerGroup4] Fetch offset 1222791071 is out of 
> range for partition trackingSolutionAttribution-48{quote}
> Those are the logs from the brokers at the same time (searched for 
> "trackingSolutionAttribution-48" OR "fireAttributionConsumerGroup4")
> {quote}Apr 5, 2020 @ 14:54:46.801 INFO Writing producer snapshot at offset 
> 1222791065
>  
> Apr 5, 2020 @ 14:54:46.801 INFO Writing producer snapshot at offset 1222791065
>  
> Apr 5, 2020 @ 14:54:46.801 INFO Rolled new log segment at offset 1222791065 
> in 0 ms.
>  
> Apr 5, 2020 @ 14:54:04.400 INFO BrokerId 1033 is no longer a coordinator for 
> the group fireAttributionConsumerGroup4. Proceeding cleanup for other alive 
> groups
>  
> Apr 5, 2020 @ 14:54:04.400 INFO BrokerId 1033 is no longer a coordinator for 
> the group fireAttributionConsumerGroup4. Proceeding cleanup for other alive 
> groups{quote}
> Another way to see the same thing, from our monitoring (DD) on the partition 
> offset:
> !https://mail.google.com/mail/u/0?ui=2=8f9b1ec48a=0.1=msg-a:r-1797075611836356806=1714b5531deb08d9=fimg=s0-l75-ft=ANGjdJ_EnC23byd8TemOhOsmVTdpfogSBTeh45zFq4EVB1OvXnSZeLO0yepieyKAm8OoIarKz6qGYuKh9Pp2Ck7CmUFvZj4LljcYfzbmsdMF3LYaN93F6aUIH0l1bMA=emb=ii_k8n355r50|width=542,height=158!
> The recovery you are seeing is after I run partition offset reset manually 
> (using kafka-consumer-groups.sh --bootstrap-server localhost:9092 --topic 
> trackingSolutionAttribution:57 --group fireAttributionConsumerGroup4 
> --reset-offsets --to-datetime 'SOME DATE')
>  
> Any idea what can be causing this? we have it happen to us at least 5 times 
> since the upgrade, and before that, I don't remember it ever happening to us.
>  
> Topic config is set to default, except the retention, which is manually set 
> to 4320.
> The topic has 60 partitions & a replication factor of 2. 
>  
> Consumer config:
> {code:java}
> ConsumerConfig values:
>   allow.auto.create.topics = true
>   auto.commit.interval.ms = 5000
>   auto.offset.reset = earliest
>   bootstrap.servers = [..]
>   check.crcs = true
>   client.dns.lookup = default
>   client.id =
>   client.rack =
>   connections.max.idle.ms = 54
>   default.api.timeout.ms = 6
>   enable.auto.commit = true
>   exclude.internal.topics = true
>   fetch.max.bytes = 52428800
>   fetch.max.wait.ms = 500
>   fetch.min.bytes = 1
>   group.id = fireAttributionConsumerGroup4
>   group.instance.id = null
>   heartbeat.interval.ms = 1
>   interceptor.classes = []
>   internal.leave.group.on.close = true
>   isolation.level = read_uncommitted
>   key.deserializer = class 
> org.apache.kafka.common.serialization.ByteArrayDeserializer
>   max.partition.fetch.bytes = 1048576
>   max.poll.interval.ms = 30
>   max.poll.records = 500
>   metadata.max.age.ms = 30
>   metric.reporters = []
>   metrics.num.samples = 2
>   metrics.recording.level = INFO
>   metrics.sample.window.ms = 3
>   partition.assignment.strategy = [class 
> org.apache.kafka.clients.consumer.RangeAssignor]
>   receive.buffer.bytes = 65536
>   reconnect.backoff.max.ms = 1000
>   reconnect.backoff.ms = 50
>   request.timeout.ms = 3
>   retry.backoff.ms = 100
>   sasl.client.callback.handler.class = null
>   sasl.jaas.config = null
>   sasl.kerberos.kinit.cmd = /usr/bin/kinit
>   

[jira] [Updated] (KAFKA-9824) Consumer loses partition offset and resets post 2.4.1 version upgrade

2020-04-06 Thread Nitay Kufert (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nitay Kufert updated KAFKA-9824:

Description: 
Hello,
 around 2 weeks ago we upgraded our Kafka clients & brokers to 2.4.1 (from 
2.3.1), 
 and we started noticing a troubling behavior that we didn't see before:
  
 Without apparent reason, a specific partition on a specific consumer loses its 
offset and start re-consuming the entire partition from the beginning 
(according to the retention).
  
 Messages appearing on the consumer (client):
{quote}Apr 5, 2020 @ 14:54:47.327 INFO sonic-fire-attribution [Consumer 
clientId=consumer-fireAttributionConsumerGroup4-2, 
groupId=fireAttributionConsumerGroup4] Resetting offset for partition 
trackingSolutionAttribution-48 to offset 1216430527.
{quote}
{quote}Apr 5, 2020 @ 14:54:46.797 INFO sonic-fire-attribution [Consumer 
clientId=consumer-fireAttributionConsumerGroup4-2, 
groupId=fireAttributionConsumerGroup4] Fetch offset 1222791071 is out of range 
for partition trackingSolutionAttribution-48
{quote}
Those are the logs from the brokers at the same time (searched for 
"trackingSolutionAttribution-48" OR "fireAttributionConsumerGroup4")
{quote}Apr 5, 2020 @ 14:54:46.801 INFO Writing producer snapshot at offset 
1222791065
  
 Apr 5, 2020 @ 14:54:46.801 INFO Writing producer snapshot at offset 1222791065
  
 Apr 5, 2020 @ 14:54:46.801 INFO Rolled new log segment at offset 1222791065 in 
0 ms.
  
 Apr 5, 2020 @ 14:54:04.400 INFO BrokerId 1033 is no longer a coordinator for 
the group fireAttributionConsumerGroup4. Proceeding cleanup for other alive 
groups
  
 Apr 5, 2020 @ 14:54:04.400 INFO BrokerId 1033 is no longer a coordinator for 
the group fireAttributionConsumerGroup4. Proceeding cleanup for other alive 
groups
{quote}
Another way to see the same thing, from our monitoring (DD) on the partition 
offset:


 The recovery you !image-2020-04-06-13-14-47-014.png! are seeing is after I run 
partition offset reset manually (using kafka-consumer-groups.sh 
--bootstrap-server localhost:9092 --topic trackingSolutionAttribution:57 
--group fireAttributionConsumerGroup4 --reset-offsets --to-datetime 'SOME DATE')
  
 Any idea what can be causing this? we have it happen to us at least 5 times 
since the upgrade, and before that, I don't remember it ever happening to us.
  
 Topic config is set to default, except the retention, which is manually set to 
4320.
 The topic has 60 partitions & a replication factor of 2. 
  
 Consumer config:
{code:java}
ConsumerConfig values:
allow.auto.create.topics = true
auto.commit.interval.ms = 5000
auto.offset.reset = earliest
bootstrap.servers = [..]
check.crcs = true
client.dns.lookup = default
client.id =
client.rack =
connections.max.idle.ms = 54
default.api.timeout.ms = 6
enable.auto.commit = true
exclude.internal.topics = true
fetch.max.bytes = 52428800
fetch.max.wait.ms = 500
fetch.min.bytes = 1
group.id = fireAttributionConsumerGroup4
group.instance.id = null
heartbeat.interval.ms = 1
interceptor.classes = []
internal.leave.group.on.close = true
isolation.level = read_uncommitted
key.deserializer = class 
org.apache.kafka.common.serialization.ByteArrayDeserializer
max.partition.fetch.bytes = 1048576
max.poll.interval.ms = 30
max.poll.records = 500
metadata.max.age.ms = 30
metric.reporters = []
metrics.num.samples = 2
metrics.recording.level = INFO
metrics.sample.window.ms = 3
partition.assignment.strategy = [class 
org.apache.kafka.clients.consumer.RangeAssignor]
receive.buffer.bytes = 65536
reconnect.backoff.max.ms = 1000
reconnect.backoff.ms = 50
request.timeout.ms = 3
retry.backoff.ms = 100
sasl.client.callback.handler.class = null
sasl.jaas.config = null
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.min.time.before.relogin = 6
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
sasl.kerberos.ticket.renew.window.factor = 0.8
sasl.login.callback.handler.class = null
sasl.login.class = null
sasl.login.refresh.buffer.seconds = 300
sasl.login.refresh.min.period.seconds = 60
sasl.login.refresh.window.factor = 0.8
sasl.login.refresh.window.jitter = 0.05
sasl.mechanism = GSSAPI
security.protocol = PLAINTEXT
security.providers = null
send.buffer.bytes = 131072
session.timeout.ms = 3
ssl.cipher.suites = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
ssl.endpoint.identification.algorithm = https
ssl.key.password = null

[jira] [Created] (KAFKA-9824) Consumer loses partition offset and resets post 2.4.1 version upgrade

2020-04-06 Thread Nitay Kufert (Jira)
Nitay Kufert created KAFKA-9824:
---

 Summary: Consumer loses partition offset and resets post 2.4.1 
version upgrade
 Key: KAFKA-9824
 URL: https://issues.apache.org/jira/browse/KAFKA-9824
 Project: Kafka
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Nitay Kufert


Hello,
around 2 weeks ago we upgraded our Kafka clients & brokers to 2.4.1 (from 
2.3.1), 
and we started noticing a troubling behavior that we didn't see before:
 
Without apparent reason, a specific partition on a specific consumer loses its 
offset and start re-consuming the entire partition from the beginning 
(according to the retention).
 
Messages appearing on the consumer (client):
{quote}Apr 5, 2020 @ 14:54:47.327 INFO sonic-fire-attribution [Consumer 
clientId=consumer-fireAttributionConsumerGroup4-2, 
groupId=fireAttributionConsumerGroup4] Resetting offset for partition 
trackingSolutionAttribution-48 to offset 1216430527.{quote}
{quote}Apr 5, 2020 @ 14:54:46.797 INFO sonic-fire-attribution [Consumer 
clientId=consumer-fireAttributionConsumerGroup4-2, 
groupId=fireAttributionConsumerGroup4] Fetch offset 1222791071 is out of range 
for partition trackingSolutionAttribution-48{quote}
Those are the logs from the brokers at the same time (searched for 
"trackingSolutionAttribution-48" OR "fireAttributionConsumerGroup4")

{quote}Apr 5, 2020 @ 14:54:46.801 INFO Writing producer snapshot at offset 
1222791065
 
Apr 5, 2020 @ 14:54:46.801 INFO Writing producer snapshot at offset 1222791065
 
Apr 5, 2020 @ 14:54:46.801 INFO Rolled new log segment at offset 1222791065 in 
0 ms.
 
Apr 5, 2020 @ 14:54:04.400 INFO BrokerId 1033 is no longer a coordinator for 
the group fireAttributionConsumerGroup4. Proceeding cleanup for other alive 
groups
 
Apr 5, 2020 @ 14:54:04.400 INFO BrokerId 1033 is no longer a coordinator for 
the group fireAttributionConsumerGroup4. Proceeding cleanup for other alive 
groups{quote}
Another way to see the same thing, from our monitoring (DD) on the partition 
offset:

!https://mail.google.com/mail/u/0?ui=2=8f9b1ec48a=0.1=msg-a:r-1797075611836356806=1714b5531deb08d9=fimg=s0-l75-ft=ANGjdJ_EnC23byd8TemOhOsmVTdpfogSBTeh45zFq4EVB1OvXnSZeLO0yepieyKAm8OoIarKz6qGYuKh9Pp2Ck7CmUFvZj4LljcYfzbmsdMF3LYaN93F6aUIH0l1bMA=emb=ii_k8n355r50|width=542,height=158!
The recovery you are seeing is after I run partition offset reset manually 
(using kafka-consumer-groups.sh --bootstrap-server localhost:9092 --topic 
trackingSolutionAttribution:57 --group fireAttributionConsumerGroup4 
--reset-offsets --to-datetime 'SOME DATE')
 
Any idea what can be causing this? we have it happen to us at least 5 times 
since the upgrade, and before that, I don't remember it ever happening to us.
 
Topic config is set to default, except the retention, which is manually set to 
4320.
The topic has 60 partitions & a replication factor of 2. 
 
Consumer config:
{code:java}
ConsumerConfig values:
allow.auto.create.topics = true
auto.commit.interval.ms = 5000
auto.offset.reset = earliest
bootstrap.servers = [..]
check.crcs = true
client.dns.lookup = default
client.id =
client.rack =
connections.max.idle.ms = 54
default.api.timeout.ms = 6
enable.auto.commit = true
exclude.internal.topics = true
fetch.max.bytes = 52428800
fetch.max.wait.ms = 500
fetch.min.bytes = 1
group.id = fireAttributionConsumerGroup4
group.instance.id = null
heartbeat.interval.ms = 1
interceptor.classes = []
internal.leave.group.on.close = true
isolation.level = read_uncommitted
key.deserializer = class 
org.apache.kafka.common.serialization.ByteArrayDeserializer
max.partition.fetch.bytes = 1048576
max.poll.interval.ms = 30
max.poll.records = 500
metadata.max.age.ms = 30
metric.reporters = []
metrics.num.samples = 2
metrics.recording.level = INFO
metrics.sample.window.ms = 3
partition.assignment.strategy = [class 
org.apache.kafka.clients.consumer.RangeAssignor]
receive.buffer.bytes = 65536
reconnect.backoff.max.ms = 1000
reconnect.backoff.ms = 50
request.timeout.ms = 3
retry.backoff.ms = 100
sasl.client.callback.handler.class = null
sasl.jaas.config = null
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.min.time.before.relogin = 6
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
sasl.kerberos.ticket.renew.window.factor = 0.8
sasl.login.callback.handler.class = null
sasl.login.class = null
sasl.login.refresh.buffer.seconds = 300
sasl.login.refresh.min.period.seconds = 60
sasl.login.refresh.window.factor = 0.8

[jira] [Commented] (KAFKA-4084) automated leader rebalance causes replication downtime for clusters with too many partitions

2020-04-06 Thread Evan Williams (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076167#comment-17076167
 ] 

Evan Williams commented on KAFKA-4084:
--

[~sql_consulting]

Many thanks for that details info. Sounds like some great work there, and very 
handy features. 

On another somewhat related note (and I can open a bug ticket for this if need 
be) - I've noticed than on a cluster (5.4), with 6000 topics - manual leader 
election times out. Is there a way to increase the timeout ? If not, then our 
only option is auto.leader.rebalance.enable=true. I guess it's important PLE 
works, for all of this functionality to work properly.

 

*kafka-leader-election --bootstrap-server $(grep advertised.listeners= 
/etc/kafka/server.properties |cut -d: -f4 |cut -d/ -f3):9092 
--all-topic-partitions --election-type preferred*

Timeout waiting for election results
Exception in thread "main" kafka.common.AdminCommandFailedException: Timeout 
waiting for election results
 at 
kafka.admin.LeaderElectionCommand$.electLeaders(LeaderElectionCommand.scala:133)
 at kafka.admin.LeaderElectionCommand$.run(LeaderElectionCommand.scala:88)
 at kafka.admin.LeaderElectionCommand$.main(LeaderElectionCommand.scala:41)
 at kafka.admin.LeaderElectionCommand.main(LeaderElectionCommand.scala)
Caused by: org.apache.kafka.common.errors.TimeoutException: Aborted due to 
timeout.

> automated leader rebalance causes replication downtime for clusters with too 
> many partitions
> 
>
> Key: KAFKA-4084
> URL: https://issues.apache.org/jira/browse/KAFKA-4084
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 0.8.2.2, 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1
>Reporter: Tom Crayford
>Priority: Major
>  Labels: reliability
> Fix For: 1.1.0
>
>
> If you enable {{auto.leader.rebalance.enable}} (which is on by default), and 
> you have a cluster with many partitions, there is a severe amount of 
> replication downtime following a restart. This causes 
> `UnderReplicatedPartitions` to fire, and replication is paused.
> This is because the current automated leader rebalance mechanism changes 
> leaders for *all* imbalanced partitions at once, instead of doing it 
> gradually. This effectively stops all replica fetchers in the cluster 
> (assuming there are enough imbalanced partitions), and restarts them. This 
> can take minutes on busy clusters, during which no replication is happening 
> and user data is at risk. Clients with {{acks=-1}} also see issues at this 
> time, because replication is effectively stalled.
> To quote Todd Palino from the mailing list:
> bq. There is an admin CLI command to trigger the preferred replica election 
> manually. There is also a broker configuration “auto.leader.rebalance.enable” 
> which you can set to have the broker automatically perform the PLE when 
> needed. DO NOT USE THIS OPTION. There are serious performance issues when 
> doing so, especially on larger clusters. It needs some development work that 
> has not been fully identified yet.
> This setting is extremely useful for smaller clusters, but with high 
> partition counts causes the huge issues stated above.
> One potential fix could be adding a new configuration for the number of 
> partitions to do automated leader rebalancing for at once, and *stop* once 
> that number of leader rebalances are in flight, until they're done. There may 
> be better mechanisms, and I'd love to hear if anybody has any ideas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-4084) automated leader rebalance causes replication downtime for clusters with too many partitions

2020-04-06 Thread GEORGE LI (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076124#comment-17076124
 ] 

GEORGE LI commented on KAFKA-4084:
--

[~blodsbror] 

If this turns out to be positive in the testing.   I can restart the discussion 
on the dev mailing list for KIP-491.at least it works/helps with 
auto.leader.rebalance.enable=true. 

There are other use cases listed in  KIP-491.   e.g.   when controller is busy 
with metadata request,  can set this dynamic config for the controller, run 
PLE, and controller will give up all its leadership, just as a follower,  CPU 
usage  down. 10-15%, making it light-weighted doing its work,  no need to 
bounce the controller.   I know some company is working on  the feature of 
separating the controller to another set of machines.  


Our primary use case of this  `leader.deprioritized.list=`  feature 
is  bundled together with another feature call  
replica.start.offlet.strategy=latest , which I have not filed for a KIP ,  
(default  is earliest like current  kafka behavior),  this is also a dynamic 
config.   can be set for broker level (or global cluster).  What it does is 
when a broker failed and lost all its local disk,  and replaced with an empty 
broker,  the empty broker will need to start replication from earliest offset 
by default,  for us,  this could be 20TB+ of data for a few hours and can cause 
outages if not throttled properly. So just like the kafka consumer,  we 
introduce dynamic config replica.start.offlet.strategy=latest ,  to just 
replicate from each partition leader's  latest offset.   Once it's caught up 
(URP=> 0 for this broker) usually in 5-10minutes or sooner, then remove the 
dynamic config,   Because this broker does not have all the historical data,  
it should not be serving leaderships.  That's how the  KIP-491. 
`leader.deprioritized.list=` is coming into play.The automation 
software will calculate the  retention time at the broker and topic level, take 
the Max,  and once the broker is in replication for that amount of time (e.g.  
6 hours,  1 day,  3days, whatever,),  the automation software will remove the 
leader.deprioritized.list dynamic config for the broker.  and run PLE to change 
the leadership back to it. 


> automated leader rebalance causes replication downtime for clusters with too 
> many partitions
> 
>
> Key: KAFKA-4084
> URL: https://issues.apache.org/jira/browse/KAFKA-4084
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 0.8.2.2, 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1
>Reporter: Tom Crayford
>Priority: Major
>  Labels: reliability
> Fix For: 1.1.0
>
>
> If you enable {{auto.leader.rebalance.enable}} (which is on by default), and 
> you have a cluster with many partitions, there is a severe amount of 
> replication downtime following a restart. This causes 
> `UnderReplicatedPartitions` to fire, and replication is paused.
> This is because the current automated leader rebalance mechanism changes 
> leaders for *all* imbalanced partitions at once, instead of doing it 
> gradually. This effectively stops all replica fetchers in the cluster 
> (assuming there are enough imbalanced partitions), and restarts them. This 
> can take minutes on busy clusters, during which no replication is happening 
> and user data is at risk. Clients with {{acks=-1}} also see issues at this 
> time, because replication is effectively stalled.
> To quote Todd Palino from the mailing list:
> bq. There is an admin CLI command to trigger the preferred replica election 
> manually. There is also a broker configuration “auto.leader.rebalance.enable” 
> which you can set to have the broker automatically perform the PLE when 
> needed. DO NOT USE THIS OPTION. There are serious performance issues when 
> doing so, especially on larger clusters. It needs some development work that 
> has not been fully identified yet.
> This setting is extremely useful for smaller clusters, but with high 
> partition counts causes the huge issues stated above.
> One potential fix could be adding a new configuration for the number of 
> partitions to do automated leader rebalancing for at once, and *stop* once 
> that number of leader rebalances are in flight, until they're done. There may 
> be better mechanisms, and I'd love to hear if anybody has any ideas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (KAFKA-4084) automated leader rebalance causes replication downtime for clusters with too many partitions

2020-04-06 Thread Evan Williams (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076092#comment-17076092
 ] 

Evan Williams edited comment on KAFKA-4084 at 4/6/20, 7:35 AM:
---

[~sql_consulting] Thanks. Yes, I already have that logic scripted up which is 
good. However for me, it's much better to have 
auto.leader.rebalance.enable=true cluster wide if possible - so that it can 
take care of instance reboots, or service (planned/unplanned) restarts, 
automatically. And that's the real benefit of this patch. Being able to leave 
that enabled, yet not killing a replaced broker with traffic until UR=0 :)

 

If this turns out to work well - what are the chances that it get's merged into 
an official release ?


was (Author: blodsbror):
[~sql_consulting] Thanks. Yes, I already have that logic scripted up which is 
good. However for me, it's much better to have 
auto.leader.rebalance.enable=true cluster wide if possible - so that it can 
take care of instance reboots, or service (planned/unplanned) restarts, 
automatically. And that's the real benefit of this patch. Being able to leave 
that enabled, yet not killing a replaced broker with traffic until UR=0 :)

> automated leader rebalance causes replication downtime for clusters with too 
> many partitions
> 
>
> Key: KAFKA-4084
> URL: https://issues.apache.org/jira/browse/KAFKA-4084
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 0.8.2.2, 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1
>Reporter: Tom Crayford
>Priority: Major
>  Labels: reliability
> Fix For: 1.1.0
>
>
> If you enable {{auto.leader.rebalance.enable}} (which is on by default), and 
> you have a cluster with many partitions, there is a severe amount of 
> replication downtime following a restart. This causes 
> `UnderReplicatedPartitions` to fire, and replication is paused.
> This is because the current automated leader rebalance mechanism changes 
> leaders for *all* imbalanced partitions at once, instead of doing it 
> gradually. This effectively stops all replica fetchers in the cluster 
> (assuming there are enough imbalanced partitions), and restarts them. This 
> can take minutes on busy clusters, during which no replication is happening 
> and user data is at risk. Clients with {{acks=-1}} also see issues at this 
> time, because replication is effectively stalled.
> To quote Todd Palino from the mailing list:
> bq. There is an admin CLI command to trigger the preferred replica election 
> manually. There is also a broker configuration “auto.leader.rebalance.enable” 
> which you can set to have the broker automatically perform the PLE when 
> needed. DO NOT USE THIS OPTION. There are serious performance issues when 
> doing so, especially on larger clusters. It needs some development work that 
> has not been fully identified yet.
> This setting is extremely useful for smaller clusters, but with high 
> partition counts causes the huge issues stated above.
> One potential fix could be adding a new configuration for the number of 
> partitions to do automated leader rebalancing for at once, and *stop* once 
> that number of leader rebalances are in flight, until they're done. There may 
> be better mechanisms, and I'd love to hear if anybody has any ideas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-4084) automated leader rebalance causes replication downtime for clusters with too many partitions

2020-04-06 Thread Evan Williams (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076092#comment-17076092
 ] 

Evan Williams commented on KAFKA-4084:
--

[~sql_consulting] Thanks. Yes, I already have that logic scripted up which is 
good. However for me, it's much better to have 
auto.leader.rebalance.enable=true cluster wide if possible - so that it can 
take care of instance reboots, or service (planned/unplanned) restarts, 
automatically. And that's the real benefit of this patch. Being able to leave 
that enabled, yet not killing a replaced broker with traffic until UR=0 :)

> automated leader rebalance causes replication downtime for clusters with too 
> many partitions
> 
>
> Key: KAFKA-4084
> URL: https://issues.apache.org/jira/browse/KAFKA-4084
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 0.8.2.2, 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1
>Reporter: Tom Crayford
>Priority: Major
>  Labels: reliability
> Fix For: 1.1.0
>
>
> If you enable {{auto.leader.rebalance.enable}} (which is on by default), and 
> you have a cluster with many partitions, there is a severe amount of 
> replication downtime following a restart. This causes 
> `UnderReplicatedPartitions` to fire, and replication is paused.
> This is because the current automated leader rebalance mechanism changes 
> leaders for *all* imbalanced partitions at once, instead of doing it 
> gradually. This effectively stops all replica fetchers in the cluster 
> (assuming there are enough imbalanced partitions), and restarts them. This 
> can take minutes on busy clusters, during which no replication is happening 
> and user data is at risk. Clients with {{acks=-1}} also see issues at this 
> time, because replication is effectively stalled.
> To quote Todd Palino from the mailing list:
> bq. There is an admin CLI command to trigger the preferred replica election 
> manually. There is also a broker configuration “auto.leader.rebalance.enable” 
> which you can set to have the broker automatically perform the PLE when 
> needed. DO NOT USE THIS OPTION. There are serious performance issues when 
> doing so, especially on larger clusters. It needs some development work that 
> has not been fully identified yet.
> This setting is extremely useful for smaller clusters, but with high 
> partition counts causes the huge issues stated above.
> One potential fix could be adding a new configuration for the number of 
> partitions to do automated leader rebalancing for at once, and *stop* once 
> that number of leader rebalances are in flight, until they're done. There may 
> be better mechanisms, and I'd love to hear if anybody has any ideas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (KAFKA-4084) automated leader rebalance causes replication downtime for clusters with too many partitions

2020-04-06 Thread Evan Williams (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076064#comment-17076064
 ] 

Evan Williams edited comment on KAFKA-4084 at 4/6/20, 7:29 AM:
---

Much appreciated [~sql_consulting].  I've requested access to the Google doc. I 
will implement and let you know how things go!. If this turns out to work well 
- what are the chances that it get's merged into an official release ?

 

Edit: Answered in the doc, thanks!

One question though - is it reasonable to still set 
auto.leader.rebalance.enable=true ?, on all brokers, with this new 
functionality ?


was (Author: blodsbror):
Much appreciated [~sql_consulting].  I've requested access to the Google doc. I 
will implement and let you know how things go!.

 

Edit: Answered in the doc, thanks!

One question though - is it reasonable to still set 
auto.leader.rebalance.enable=true ?, on all brokers, with this new 
functionality ?

> automated leader rebalance causes replication downtime for clusters with too 
> many partitions
> 
>
> Key: KAFKA-4084
> URL: https://issues.apache.org/jira/browse/KAFKA-4084
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 0.8.2.2, 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1
>Reporter: Tom Crayford
>Priority: Major
>  Labels: reliability
> Fix For: 1.1.0
>
>
> If you enable {{auto.leader.rebalance.enable}} (which is on by default), and 
> you have a cluster with many partitions, there is a severe amount of 
> replication downtime following a restart. This causes 
> `UnderReplicatedPartitions` to fire, and replication is paused.
> This is because the current automated leader rebalance mechanism changes 
> leaders for *all* imbalanced partitions at once, instead of doing it 
> gradually. This effectively stops all replica fetchers in the cluster 
> (assuming there are enough imbalanced partitions), and restarts them. This 
> can take minutes on busy clusters, during which no replication is happening 
> and user data is at risk. Clients with {{acks=-1}} also see issues at this 
> time, because replication is effectively stalled.
> To quote Todd Palino from the mailing list:
> bq. There is an admin CLI command to trigger the preferred replica election 
> manually. There is also a broker configuration “auto.leader.rebalance.enable” 
> which you can set to have the broker automatically perform the PLE when 
> needed. DO NOT USE THIS OPTION. There are serious performance issues when 
> doing so, especially on larger clusters. It needs some development work that 
> has not been fully identified yet.
> This setting is extremely useful for smaller clusters, but with high 
> partition counts causes the huge issues stated above.
> One potential fix could be adding a new configuration for the number of 
> partitions to do automated leader rebalancing for at once, and *stop* once 
> that number of leader rebalances are in flight, until they're done. There may 
> be better mechanisms, and I'd love to hear if anybody has any ideas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-4084) automated leader rebalance causes replication downtime for clusters with too many partitions

2020-04-06 Thread GEORGE LI (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076089#comment-17076089
 ] 

GEORGE LI commented on KAFKA-4084:
--

[~blodsbror]

I think it's still ok to set auto.leader.rebalance.enable=true.  but for the 
broker that had failed and replaced coming up empty.  

There should be some kind of automation that first set the 
`leader.deprioritized.list=`  dynamic config at ''  cluster 
global level, so current controller and possible another failover controller 
can make it in effect immediately.  Then start the new replaced broker.  During 
the time the broker is busy catching up.   Because it's in the lower priority 
for being considered for leaders,  the auto.leader.rebalance.enable=true will 
be sort of disabled automatically for this broker.  

After this broker catches up. e.g.  URP => 0,   CPU/Disk, etc. back to normal.  
 the dynamic config above can be removed by the automation script. and  with 
auto.leader.rebalance.enable=true, the leaders will  be auto going to its 
Preferred leader (first/head of the partition assignment) of this broker. 






> automated leader rebalance causes replication downtime for clusters with too 
> many partitions
> 
>
> Key: KAFKA-4084
> URL: https://issues.apache.org/jira/browse/KAFKA-4084
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 0.8.2.2, 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1
>Reporter: Tom Crayford
>Priority: Major
>  Labels: reliability
> Fix For: 1.1.0
>
>
> If you enable {{auto.leader.rebalance.enable}} (which is on by default), and 
> you have a cluster with many partitions, there is a severe amount of 
> replication downtime following a restart. This causes 
> `UnderReplicatedPartitions` to fire, and replication is paused.
> This is because the current automated leader rebalance mechanism changes 
> leaders for *all* imbalanced partitions at once, instead of doing it 
> gradually. This effectively stops all replica fetchers in the cluster 
> (assuming there are enough imbalanced partitions), and restarts them. This 
> can take minutes on busy clusters, during which no replication is happening 
> and user data is at risk. Clients with {{acks=-1}} also see issues at this 
> time, because replication is effectively stalled.
> To quote Todd Palino from the mailing list:
> bq. There is an admin CLI command to trigger the preferred replica election 
> manually. There is also a broker configuration “auto.leader.rebalance.enable” 
> which you can set to have the broker automatically perform the PLE when 
> needed. DO NOT USE THIS OPTION. There are serious performance issues when 
> doing so, especially on larger clusters. It needs some development work that 
> has not been fully identified yet.
> This setting is extremely useful for smaller clusters, but with high 
> partition counts causes the huge issues stated above.
> One potential fix could be adding a new configuration for the number of 
> partitions to do automated leader rebalancing for at once, and *stop* once 
> that number of leader rebalances are in flight, until they're done. There may 
> be better mechanisms, and I'd love to hear if anybody has any ideas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (KAFKA-4084) automated leader rebalance causes replication downtime for clusters with too many partitions

2020-04-06 Thread Evan Williams (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076064#comment-17076064
 ] 

Evan Williams edited comment on KAFKA-4084 at 4/6/20, 7:24 AM:
---

Much appreciated [~sql_consulting].  I've requested access to the Google doc. I 
will implement and let you know how things go!.

 

Edit: Answered in the doc, thanks!

One question though - is it reasonable to still set 
auto.leader.rebalance.enable=true ?, on all brokers, with this new 
functionality ?


was (Author: blodsbror):
Much appreciated [~sql_consulting].  I've requested access to the Google doc. I 
will implement and let you know how things go!. One question though - is it 
reasonable to still set auto.leader.rebalance.enable=true ?, on all brokers, 
with this new functionality ?

> automated leader rebalance causes replication downtime for clusters with too 
> many partitions
> 
>
> Key: KAFKA-4084
> URL: https://issues.apache.org/jira/browse/KAFKA-4084
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 0.8.2.2, 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1
>Reporter: Tom Crayford
>Priority: Major
>  Labels: reliability
> Fix For: 1.1.0
>
>
> If you enable {{auto.leader.rebalance.enable}} (which is on by default), and 
> you have a cluster with many partitions, there is a severe amount of 
> replication downtime following a restart. This causes 
> `UnderReplicatedPartitions` to fire, and replication is paused.
> This is because the current automated leader rebalance mechanism changes 
> leaders for *all* imbalanced partitions at once, instead of doing it 
> gradually. This effectively stops all replica fetchers in the cluster 
> (assuming there are enough imbalanced partitions), and restarts them. This 
> can take minutes on busy clusters, during which no replication is happening 
> and user data is at risk. Clients with {{acks=-1}} also see issues at this 
> time, because replication is effectively stalled.
> To quote Todd Palino from the mailing list:
> bq. There is an admin CLI command to trigger the preferred replica election 
> manually. There is also a broker configuration “auto.leader.rebalance.enable” 
> which you can set to have the broker automatically perform the PLE when 
> needed. DO NOT USE THIS OPTION. There are serious performance issues when 
> doing so, especially on larger clusters. It needs some development work that 
> has not been fully identified yet.
> This setting is extremely useful for smaller clusters, but with high 
> partition counts causes the huge issues stated above.
> One potential fix could be adding a new configuration for the number of 
> partitions to do automated leader rebalancing for at once, and *stop* once 
> that number of leader rebalances are in flight, until they're done. There may 
> be better mechanisms, and I'd love to hear if anybody has any ideas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (KAFKA-4084) automated leader rebalance causes replication downtime for clusters with too many partitions

2020-04-06 Thread GEORGE LI (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076006#comment-17076006
 ] 

GEORGE LI edited comment on KAFKA-4084 at 4/6/20, 7:13 AM:
---

[~blodsbror]

Have some free time this weekend to troubleshoot and found out after 1.1 , at 
least in 2.4,  the controller code has some optimization for PLE, not running 
PLE at all if  current_leader == Head of Replica.   That had cause my 
unit/integration tests to fail.  I have patched that as well.   I have landed 
my code change to  my repo's feature branch. 2.4-leader-deprioritized-list 
(based on the 2.4 branch)

detail installation and testing  steps in this [Google 
doc|https://docs.google.com/document/d/14vlPkbaog_5Xdd-HB4vMRaQQ7Fq4SlxddULsvc3PlbY/edit].
   Please let me know if you have issues with the patch/testing. If can not 
view the doc, please click the request access button. or send me your email to 
add to the share.   my email: sqlconsult...@gmail.com

Please keep us posted with your testing results. 

Thanks,
George



was (Author: sql_consulting):
[~blodsbror]

Have some free time this weekend to troubleshoot and found out after 1.1 , at 
least in 2.4,  the controller code has some optimization for PLE, not running 
PLE at all if  current_leader == Head of Replica.   That had cause my 
unit/integration tests to fail.  I have patched that as well.   I have landed 
my code change to  my repo's feature branch. 2.4-leader-deprioritized-list 
(based on the 2.4 branch)

detail installation and testing  steps in this [Google 
doc|https://docs.google.com/document/d/1ZuOcYTSuCAqCut_hjI_EY3lA9W7BuIlHVUdOUcSpWww/edit].
   Please let me know if you have issues with the patch/testing. If can not 
view the doc, please click the request access button. or send me your email to 
add to the share.   my email: sqlconsult...@gmail.com

Please keep us posted with your testing results. 

Thanks,
George


> automated leader rebalance causes replication downtime for clusters with too 
> many partitions
> 
>
> Key: KAFKA-4084
> URL: https://issues.apache.org/jira/browse/KAFKA-4084
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 0.8.2.2, 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1
>Reporter: Tom Crayford
>Priority: Major
>  Labels: reliability
> Fix For: 1.1.0
>
>
> If you enable {{auto.leader.rebalance.enable}} (which is on by default), and 
> you have a cluster with many partitions, there is a severe amount of 
> replication downtime following a restart. This causes 
> `UnderReplicatedPartitions` to fire, and replication is paused.
> This is because the current automated leader rebalance mechanism changes 
> leaders for *all* imbalanced partitions at once, instead of doing it 
> gradually. This effectively stops all replica fetchers in the cluster 
> (assuming there are enough imbalanced partitions), and restarts them. This 
> can take minutes on busy clusters, during which no replication is happening 
> and user data is at risk. Clients with {{acks=-1}} also see issues at this 
> time, because replication is effectively stalled.
> To quote Todd Palino from the mailing list:
> bq. There is an admin CLI command to trigger the preferred replica election 
> manually. There is also a broker configuration “auto.leader.rebalance.enable” 
> which you can set to have the broker automatically perform the PLE when 
> needed. DO NOT USE THIS OPTION. There are serious performance issues when 
> doing so, especially on larger clusters. It needs some development work that 
> has not been fully identified yet.
> This setting is extremely useful for smaller clusters, but with high 
> partition counts causes the huge issues stated above.
> One potential fix could be adding a new configuration for the number of 
> partitions to do automated leader rebalancing for at once, and *stop* once 
> that number of leader rebalances are in flight, until they're done. There may 
> be better mechanisms, and I'd love to hear if anybody has any ideas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-4084) automated leader rebalance causes replication downtime for clusters with too many partitions

2020-04-06 Thread Evan Williams (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076064#comment-17076064
 ] 

Evan Williams commented on KAFKA-4084:
--

Much appreciated [~sql_consulting].  I've requested access to the Google doc. I 
will implement and let you know how things go!. One question though - is it 
reasonable to still set auto.leader.rebalance.enable=true ?, on all brokers, 
with this new functionality ?

> automated leader rebalance causes replication downtime for clusters with too 
> many partitions
> 
>
> Key: KAFKA-4084
> URL: https://issues.apache.org/jira/browse/KAFKA-4084
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 0.8.2.2, 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1
>Reporter: Tom Crayford
>Priority: Major
>  Labels: reliability
> Fix For: 1.1.0
>
>
> If you enable {{auto.leader.rebalance.enable}} (which is on by default), and 
> you have a cluster with many partitions, there is a severe amount of 
> replication downtime following a restart. This causes 
> `UnderReplicatedPartitions` to fire, and replication is paused.
> This is because the current automated leader rebalance mechanism changes 
> leaders for *all* imbalanced partitions at once, instead of doing it 
> gradually. This effectively stops all replica fetchers in the cluster 
> (assuming there are enough imbalanced partitions), and restarts them. This 
> can take minutes on busy clusters, during which no replication is happening 
> and user data is at risk. Clients with {{acks=-1}} also see issues at this 
> time, because replication is effectively stalled.
> To quote Todd Palino from the mailing list:
> bq. There is an admin CLI command to trigger the preferred replica election 
> manually. There is also a broker configuration “auto.leader.rebalance.enable” 
> which you can set to have the broker automatically perform the PLE when 
> needed. DO NOT USE THIS OPTION. There are serious performance issues when 
> doing so, especially on larger clusters. It needs some development work that 
> has not been fully identified yet.
> This setting is extremely useful for smaller clusters, but with high 
> partition counts causes the huge issues stated above.
> One potential fix could be adding a new configuration for the number of 
> partitions to do automated leader rebalancing for at once, and *stop* once 
> that number of leader rebalances are in flight, until they're done. There may 
> be better mechanisms, and I'd love to hear if anybody has any ideas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)