[jira] [Reopened] (KAFKA-9208) Flaky Test SslAdminClientIntegrationTest.testCreatePartitions

2019-11-19 Thread Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sophie Blee-Goldman reopened KAFKA-9208:


Note that this test is distinct from the similar flaky test 
AdminClientIntegrationTest.testCreatePartitions, and does not duplicate 
KAFKA-9069

> Flaky Test SslAdminClientIntegrationTest.testCreatePartitions
> -
>
> Key: KAFKA-9208
> URL: https://issues.apache.org/jira/browse/KAFKA-9208
> Project: Kafka
>  Issue Type: Bug
>  Components: admin
>Affects Versions: 2.4.0
>Reporter: Sophie Blee-Goldman
>Priority: Major
>  Labels: flaky-test
>
> Java 8 build failed on 2.4-targeted PR
> h3. Stacktrace
> java.lang.AssertionError: validateOnly expected:<3> but was:<1> at 
> org.junit.Assert.fail(Assert.java:89) at 
> org.junit.Assert.failNotEquals(Assert.java:835) at 
> org.junit.Assert.assertEquals(Assert.java:647) at 
> kafka.api.AdminClientIntegrationTest$$anonfun$testCreatePartitions$6.apply(AdminClientIntegrationTest.scala:625)
>  at 
> kafka.api.AdminClientIntegrationTest$$anonfun$testCreatePartitions$6.apply(AdminClientIntegrationTest.scala:599)
>  at scala.collection.immutable.List.foreach(List.scala:392) at 
> kafka.api.AdminClientIntegrationTest.testCreatePartitions(AdminClientIntegrationTest.scala:599)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
>  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
>  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>  at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) 
> at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) 
> at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:288)
>  at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:282)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
> java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-9208) Flaky Test SslAdminClientIntegrationTest.testCreatePartitions

2019-11-19 Thread huxihx (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huxihx resolved KAFKA-9208.
---
Resolution: Duplicate

> Flaky Test SslAdminClientIntegrationTest.testCreatePartitions
> -
>
> Key: KAFKA-9208
> URL: https://issues.apache.org/jira/browse/KAFKA-9208
> Project: Kafka
>  Issue Type: Bug
>  Components: admin
>Affects Versions: 2.4.0
>Reporter: Sophie Blee-Goldman
>Priority: Major
>  Labels: flaky-test
>
> Java 8 build failed on 2.4-targeted PR
> h3. Stacktrace
> java.lang.AssertionError: validateOnly expected:<3> but was:<1> at 
> org.junit.Assert.fail(Assert.java:89) at 
> org.junit.Assert.failNotEquals(Assert.java:835) at 
> org.junit.Assert.assertEquals(Assert.java:647) at 
> kafka.api.AdminClientIntegrationTest$$anonfun$testCreatePartitions$6.apply(AdminClientIntegrationTest.scala:625)
>  at 
> kafka.api.AdminClientIntegrationTest$$anonfun$testCreatePartitions$6.apply(AdminClientIntegrationTest.scala:599)
>  at scala.collection.immutable.List.foreach(List.scala:392) at 
> kafka.api.AdminClientIntegrationTest.testCreatePartitions(AdminClientIntegrationTest.scala:599)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
>  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
>  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>  at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) 
> at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) 
> at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:288)
>  at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:282)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
> java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-9051) Source task source offset reads can block graceful shutdown

2019-11-19 Thread Randall Hauch (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Randall Hauch updated KAFKA-9051:
-
Fix Version/s: 2.3.2
   2.5.0
   2.2.3
   2.1.2
   2.0.2

Merged to the following branches:
* `trunk` for inclusion in 2.5.0
* `2.3` for inclusion in 2.3.2
* `2.2` for inclusion in 2.2.3
* `2.1` for inclusion in 2.1.2
* `2.0` for inclusion in 2.0.2

These versions are the next to be released from each of these branches.

We are currently in code freeze on the `2.4` branch for the upcoming AK 2.4.0 
release. *As this issue is not a blocker for the AK 2.4.0 release, this commit 
(da4337271) will be merged to the `2.4` branch once that code freeze has been 
lifted* and AK 2.4.0 has been released, and will be included in the subsequent 
2.4.1 release.

> Source task source offset reads can block graceful shutdown
> ---
>
> Key: KAFKA-9051
> URL: https://issues.apache.org/jira/browse/KAFKA-9051
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
>Affects Versions: 1.0.2, 1.1.1, 2.0.1, 2.1.1, 2.3.0, 2.2.1, 2.4.0, 2.5.0
>Reporter: Chris Egerton
>Assignee: Chris Egerton
>Priority: Major
> Fix For: 2.0.2, 2.1.2, 2.2.3, 2.5.0, 2.3.2
>
>
> When source tasks request source offsets from the framework, this results in 
> a call to 
> [Future.get()|https://github.com/apache/kafka/blob/8966d066bd2f80c6d8f270423e7e9982097f97b9/connect/runtime/src/main/java/org/apache/kafka/connect/storage/OffsetStorageReaderImpl.java#L79]
>  with no timeout. In distributed workers, the future is blocked on a 
> successful [read to the 
> end|https://github.com/apache/kafka/blob/8966d066bd2f80c6d8f270423e7e9982097f97b9/connect/runtime/src/main/java/org/apache/kafka/connect/storage/KafkaOffsetBackingStore.java#L136]
>  of the source offsets topic, which in turn will [poll that topic 
> indefinitely|https://github.com/apache/kafka/blob/8966d066bd2f80c6d8f270423e7e9982097f97b9/connect/runtime/src/main/java/org/apache/kafka/connect/util/KafkaBasedLog.java#L287]
>  until the latest messages for every partition of that topic have been 
> consumed.
> This normally completes in a reasonable amount of time. However, if the 
> connectivity between the Connect worker and the Kafka cluster is degraded or 
> dropped in the middle of one of these reads, it will block until connectivity 
> is restored and the request completes successfully.
> If a task is stopped (due to a manual restart via the REST API, a rebalance, 
> worker shutdown, etc.) while blocked on a read of source offsets during its 
> {{start}} method, not only will it fail to gracefully stop, but the framework 
> [will not even invoke its stop 
> method|https://github.com/apache/kafka/blob/8966d066bd2f80c6d8f270423e7e9982097f97b9/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSourceTask.java#L183]
>  until its {{start}} method (and, as a result, the source offset read 
> request) [has 
> completed|https://github.com/apache/kafka/blob/8966d066bd2f80c6d8f270423e7e9982097f97b9/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSourceTask.java#L202-L206].
>  This prevents the task from being able to clean up any resources it has 
> allocated and can lead to OOM errors, excessive thread creation, and other 
> problems.
>  
> I've confirmed that this affects every release of Connect back through 1.0 at 
> least; I've tagged the most recent bug fix release of every major/minor 
> version from then on in the {{Affects Version/s}} field to avoid just putting 
> every version in that field.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9051) Source task source offset reads can block graceful shutdown

2019-11-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16978074#comment-16978074
 ] 

ASF GitHub Bot commented on KAFKA-9051:
---

rhauch commented on pull request #7532: KAFKA-9051: Prematurely complete source 
offset read requests for stopped tasks
URL: https://github.com/apache/kafka/pull/7532
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Source task source offset reads can block graceful shutdown
> ---
>
> Key: KAFKA-9051
> URL: https://issues.apache.org/jira/browse/KAFKA-9051
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
>Affects Versions: 1.0.2, 1.1.1, 2.0.1, 2.1.1, 2.3.0, 2.2.1, 2.4.0, 2.5.0
>Reporter: Chris Egerton
>Assignee: Chris Egerton
>Priority: Major
>
> When source tasks request source offsets from the framework, this results in 
> a call to 
> [Future.get()|https://github.com/apache/kafka/blob/8966d066bd2f80c6d8f270423e7e9982097f97b9/connect/runtime/src/main/java/org/apache/kafka/connect/storage/OffsetStorageReaderImpl.java#L79]
>  with no timeout. In distributed workers, the future is blocked on a 
> successful [read to the 
> end|https://github.com/apache/kafka/blob/8966d066bd2f80c6d8f270423e7e9982097f97b9/connect/runtime/src/main/java/org/apache/kafka/connect/storage/KafkaOffsetBackingStore.java#L136]
>  of the source offsets topic, which in turn will [poll that topic 
> indefinitely|https://github.com/apache/kafka/blob/8966d066bd2f80c6d8f270423e7e9982097f97b9/connect/runtime/src/main/java/org/apache/kafka/connect/util/KafkaBasedLog.java#L287]
>  until the latest messages for every partition of that topic have been 
> consumed.
> This normally completes in a reasonable amount of time. However, if the 
> connectivity between the Connect worker and the Kafka cluster is degraded or 
> dropped in the middle of one of these reads, it will block until connectivity 
> is restored and the request completes successfully.
> If a task is stopped (due to a manual restart via the REST API, a rebalance, 
> worker shutdown, etc.) while blocked on a read of source offsets during its 
> {{start}} method, not only will it fail to gracefully stop, but the framework 
> [will not even invoke its stop 
> method|https://github.com/apache/kafka/blob/8966d066bd2f80c6d8f270423e7e9982097f97b9/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSourceTask.java#L183]
>  until its {{start}} method (and, as a result, the source offset read 
> request) [has 
> completed|https://github.com/apache/kafka/blob/8966d066bd2f80c6d8f270423e7e9982097f97b9/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSourceTask.java#L202-L206].
>  This prevents the task from being able to clean up any resources it has 
> allocated and can lead to OOM errors, excessive thread creation, and other 
> problems.
>  
> I've confirmed that this affects every release of Connect back through 1.0 at 
> least; I've tagged the most recent bug fix release of every major/minor 
> version from then on in the {{Affects Version/s}} field to avoid just putting 
> every version in that field.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-9214) test suite generates different errors each time

2019-11-19 Thread AK97 (Jira)
AK97 created KAFKA-9214:
---

 Summary: test suite generates different errors each time
 Key: KAFKA-9214
 URL: https://issues.apache.org/jira/browse/KAFKA-9214
 Project: Kafka
  Issue Type: Bug
Affects Versions: 2.3.0
 Environment: os: rhel:7.6
architecture: ppc64le
Reporter: AK97


I have been running the apache/kafka test suite approx. 6/7 times and at each 
execution it throws up a different set of errors. Some of the errors thrown are 
as follows.

However, note that they aren't the same set seen each time . 

Would like some help on understanding the cause for the same . I am running it 
on a High end VM with good connectivity.

 

Errors:

1)

org.apache.kafka.common.security.authenticator.SaslAuthenticatorTest > 
testMultipleServerMechanisms FAILED
    java.lang.AssertionError: Metric not updated 
successful-reauthentication-total expected:<0.0> but was:<1.0> expected:<0.0> 
but was:<1.0>

 

2)

kafka.admin.ResetConsumerGroupOffsetTest > testResetOffsetsExistingTopic FAILED
    java.util.concurrent.ExecutionException: 
org.apache.kafka.common.errors.CoordinatorNotAvailableException: The 
coordinator is not available

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-9203) kafka-client 2.3.1 fails to consume lz4 compressed topic in kafka 2.1.1

2019-11-19 Thread Ismael Juma (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismael Juma updated KAFKA-9203:
---
Priority: Critical  (was: Major)

> kafka-client 2.3.1 fails to consume lz4 compressed topic in kafka 2.1.1
> ---
>
> Key: KAFKA-9203
> URL: https://issues.apache.org/jira/browse/KAFKA-9203
> Project: Kafka
>  Issue Type: Bug
>  Components: compression, consumer
>Affects Versions: 2.3.0, 2.3.1
>Reporter: David Watzke
>Priority: Critical
>
> I run kafka cluster 2.1.1
> when I upgraded a client app to use kafka-client 2.3.0 (or 2.3.1) instead of 
> 2.2.0, I immediately started getting the following exceptions in a loop when 
> consuming a topic with LZ4-compressed messages:
> {noformat}
> 2019-11-18 11:59:16.888 ERROR [afka-consumer-thread] 
> com.avast.utils2.kafka.consumer.ConsumerThread: Unexpected exception occurred 
> while polling and processing messages: org.apache.kafka.common.KafkaExce
> ption: Received exception when fetching the next record from 
> FILEREP_LONER_REQUEST-4. If needed, please seek past the record to continue 
> consumption. 
> org.apache.kafka.common.KafkaException: Received exception when fetching the 
> next record from FILEREP_LONER_REQUEST-4. If needed, please seek past the 
> record to continue consumption. 
>     at 
> org.apache.kafka.clients.consumer.internals.Fetcher$PartitionRecords.fetchRecords(Fetcher.java:1473)
>  
>     at 
> org.apache.kafka.clients.consumer.internals.Fetcher$PartitionRecords.access$1600(Fetcher.java:1332)
>  
>     at 
> org.apache.kafka.clients.consumer.internals.Fetcher.fetchRecords(Fetcher.java:645)
>  
>     at 
> org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:606)
>  
>     at 
> org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1263)
>  
>     at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1225) 
>     at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1201) 
>     at 
> com.avast.utils2.kafka.consumer.ConsumerThread.pollAndProcessMessagesFromKafka(ConsumerThread.java:180)
>  
>     at 
> com.avast.utils2.kafka.consumer.ConsumerThread.run(ConsumerThread.java:149) 
>     at 
> com.avast.filerep.saver.RequestSaver$.$anonfun$main$8(RequestSaver.scala:28) 
>     at 
> com.avast.filerep.saver.RequestSaver$.$anonfun$main$8$adapted(RequestSaver.scala:19)
>  
>     at 
> resource.AbstractManagedResource.$anonfun$acquireFor$1(AbstractManagedResource.scala:88)
>  
>     at 
> scala.util.control.Exception$Catch.$anonfun$either$1(Exception.scala:252) 
>     at scala.util.control.Exception$Catch.apply(Exception.scala:228) 
>     at scala.util.control.Exception$Catch.either(Exception.scala:252) 
>     at 
> resource.AbstractManagedResource.acquireFor(AbstractManagedResource.scala:88) 
>     at 
> resource.ManagedResourceOperations.apply(ManagedResourceOperations.scala:26) 
>     at 
> resource.ManagedResourceOperations.apply$(ManagedResourceOperations.scala:26) 
>     at 
> resource.AbstractManagedResource.apply(AbstractManagedResource.scala:50) 
>     at 
> resource.ManagedResourceOperations.acquireAndGet(ManagedResourceOperations.scala:25)
>  
>     at 
> resource.ManagedResourceOperations.acquireAndGet$(ManagedResourceOperations.scala:25)
>  
>     at 
> resource.AbstractManagedResource.acquireAndGet(AbstractManagedResource.scala:50)
>  
>     at 
> resource.ManagedResourceOperations.foreach(ManagedResourceOperations.scala:53)
>  
>     at 
> resource.ManagedResourceOperations.foreach$(ManagedResourceOperations.scala:53)
>  
>     at 
> resource.AbstractManagedResource.foreach(AbstractManagedResource.scala:50) 
>     at 
> com.avast.filerep.saver.RequestSaver$.$anonfun$main$6(RequestSaver.scala:19) 
>     at 
> com.avast.filerep.saver.RequestSaver$.$anonfun$main$6$adapted(RequestSaver.scala:18)
>  
>     at 
> resource.AbstractManagedResource.$anonfun$acquireFor$1(AbstractManagedResource.scala:88)
>  
>     at 
> scala.util.control.Exception$Catch.$anonfun$either$1(Exception.scala:252) 
>     at scala.util.control.Exception$Catch.apply(Exception.scala:228) 
>     at scala.util.control.Exception$Catch.either(Exception.scala:252) 
>     at 
> resource.AbstractManagedResource.acquireFor(AbstractManagedResource.scala:88) 
>     at 
> resource.ManagedResourceOperations.apply(ManagedResourceOperations.scala:26) 
>     at 
> resource.ManagedResourceOperations.apply$(ManagedResourceOperations.scala:26) 
>     at 
> resource.AbstractManagedResource.apply(AbstractManagedResource.scala:50) 
>     at 
> 

[jira] [Updated] (KAFKA-9212) Keep receiving FENCED_LEADER_EPOCH while sending ListOffsetRequest

2019-11-19 Thread Ismael Juma (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismael Juma updated KAFKA-9212:
---
Priority: Critical  (was: Major)

> Keep receiving FENCED_LEADER_EPOCH while sending ListOffsetRequest
> --
>
> Key: KAFKA-9212
> URL: https://issues.apache.org/jira/browse/KAFKA-9212
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer, offset manager
>Affects Versions: 2.3.0
> Environment: Linux
>Reporter: Yannick
>Priority: Critical
>
> When running Kafka connect s3 sink connector ( confluent 5.3.0), after one 
> broker got restarted (leaderEpoch updated at this point), the connect worker 
> crashed with the following error : 
> [2019-11-19 16:20:30,097] ERROR [Worker clientId=connect-1, 
> groupId=connect-ls] Uncaught exception in herder work thread, exiting: 
> (org.apache.kafka.connect.runtime.distributed.DistributedHerder:253)
>  org.apache.kafka.common.errors.TimeoutException: Failed to get offsets by 
> times in 30003ms
>  
> After investigation, it seems it's because it got fenced when sending 
> ListOffsetRequest in loop and then got timed out , as follows :
> [2019-11-19 16:20:30,020] DEBUG [Consumer clientId=consumer-3, 
> groupId=connect-ls] Sending ListOffsetRequest (type=ListOffsetRequest, 
> replicaId=-1, partitionTimestamps={connect_ls_config-0={timestamp: -1, 
> maxNumOffsets: 1, currentLeaderEpoch: Optional[1]}}, 
> isolationLevel=READ_UNCOMMITTED) to broker kafka6.fra2.internal:9092 (id: 4 
> rack: null) (org.apache.kafka.clients.consumer.internals.Fetcher:905)
> [2019-11-19 16:20:30,044] DEBUG [Consumer clientId=consumer-3, 
> groupId=connect-ls] Attempt to fetch offsets for partition 
> connect_ls_config-0 failed due to FENCED_LEADER_EPOCH, retrying. 
> (org.apache.kafka.clients.consumer.internals.Fetcher:985)
>  
> The above happens multiple times until timeout.
>  
> According to the debugs, the consumer always get a leaderEpoch of 1 for this 
> topic when starting up :
>  
>  [2019-11-19 13:27:30,802] DEBUG [Consumer clientId=consumer-3, 
> groupId=connect-ls] Updating last seen epoch from null to 1 for partition 
> connect_ls_config-0 (org.apache.kafka.clients.Metadata:178)
>   
>   
>  But according to our brokers log, the leaderEpoch should be 2, as follows :
>   
>  [2019-11-18 14:19:28,988] INFO [Partition connect_ls_config-0 broker=4] 
> connect_ls_config-0 starts at Leader Epoch 2 from offset 22. Previous Leader 
> Epoch was: 1 (kafka.cluster.Partition)
>   
>   
>  This make impossible to restart the worker as it will always get fenced and 
> then finally timeout.
>   
>  It is also impossible to consume with a 2.3 kafka-console-consumer as 
> follows :
>   
>  kafka-console-consumer --bootstrap-server BOOTSTRAPSERVER:9092 --topic 
> connect_ls_config --from-beginning 
>   
>  the above will just hang forever ( which is not expected cause there is 
> data) and we can see those debug messages :
> [2019-11-19 22:17:59,124] DEBUG [Consumer clientId=consumer-1, 
> groupId=console-consumer-3844] Attempt to fetch offsets for partition 
> connect_ls_config-0 failed due to FENCED_LEADER_EPOCH, retrying. 
> (org.apache.kafka.clients.consumer.internals.Fetcher)
>   
>   
>  Interesting fact, if we do subscribe the same way with kafkacat (1.5.0) we 
> can consume without problem ( must be the way kafkacat is consuming which is 
> different somehow):
>   
>  kafkacat -b BOOTSTRAPSERVER:9092 -t connect_ls_config -o beginning
>   
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-6962) DescribeConfigsRequest Schema documentation is wrong/missing detail

2019-11-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977948#comment-16977948
 ] 

ASF GitHub Bot commented on KAFKA-6962:
---

kdrakon commented on pull request #5091: KAFKA-6962 added better docs to 
DESCRIBE_CONFIGS_REQUEST_RESOURCE_V0
URL: https://github.com/apache/kafka/pull/5091
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> DescribeConfigsRequest Schema documentation is wrong/missing detail
> ---
>
> Key: KAFKA-6962
> URL: https://issues.apache.org/jira/browse/KAFKA-6962
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 1.1.0
>Reporter: Sean Policarpio
>Priority: Minor
>
> The Resource fields for DescribeConfigsRequest for the following fields are 
> all {{null}}:
>  * resource_type
>  * resource_name
>  * config_names
> -Additionally, after using the API, I've also noted that {{resource_name}} 
> should probably be listed as a nullable String since it's optional.-
> The PR attached would output something like the following:
> *Requests:*
> {{DescribeConfigs Request (Version: 0) => [resources]}}
>  {{  resources => resource_type resource_name [config_names]}}
>  {{    resource_type => INT8}}
>  {{    resource_name => STRING}}
>  {{    config_names => STRING}}
>  
> ||Field||Description||
> |resources|An array of config resources to be returned.|
> |resource_type|The resource type, which is one of 0 (UNKNOWN), 1 (ANY), 2 
> (TOPIC), 3 (GROUP), 4 (BROKER)|
> |resource_name|The resource name to query.|
> |config_names|An array of config names to retrieve. If set to null, then all 
> configs are returned for resource_name.|
>   
> {{DescribeConfigs Request (Version: 1) => [resources] include_synonyms }}
>  {{  resources => resource_type resource_name [config_names] }}
>  {{    resource_type => INT8}}
>  {{    resource_name => STRING}}
>  {{    config_names => STRING}}
>  {{  include_synonyms => BOOLEAN}}
> ||Field||Description||
> |resources|An array of config resources to be returned.|
> |resource_type|The resource type, which is one of 0 (UNKNOWN), 1 (ANY), 2 
> (TOPIC), 3 (GROUP), 4 (BROKER)|
> |resource_name|The resource name to query.|
> |config_names|An array of config names to retrieve. If set to null, then all 
> configs are returned for resource_name.|
> |include_synonyms|null|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-9213) BufferOverflowException on rolling new segment after upgrading Kafka from 1.1.0 to 2.3.1

2019-11-19 Thread Daniyar (Jira)
Daniyar created KAFKA-9213:
--

 Summary: BufferOverflowException on rolling new segment after 
upgrading Kafka from 1.1.0 to 2.3.1
 Key: KAFKA-9213
 URL: https://issues.apache.org/jira/browse/KAFKA-9213
 Project: Kafka
  Issue Type: Bug
  Components: log
Affects Versions: 2.3.1
 Environment: Ubuntu 16.04, AWS instance d2.8xlarge.

JAVA Options:

-Xms16G 
-Xmx16G 
-XX:G1HeapRegionSize=16M 
-XX:MetaspaceSize=96m 
-XX:MinMetaspaceFreeRatio=50 
Reporter: Daniyar


We updated our Kafka cluster from 1.1.0 version to 2.3.1. We followed up to 
step 2 of the [update 
instruction|[https://kafka.apache.org/documentation/#upgrade]].

Message format and inter-broker protocol versions were left the same:

inter.broker.protocol.version=1.1

log.message.format.version=1.1

 

After upgrading, we started to get some occasional exceptions:
{code:java}
2019/11/19 05:30:53 INFO [ProducerStateManager
partition=matchmaker_retry_clicks_15m-2] Writing producer snapshot at
offset 788532 (kafka.log.ProducerStateManager)
2019/11/19 05:30:53 INFO [Log partition=matchmaker_retry_clicks_15m-2,
dir=/mnt/kafka] Rolled new log segment at offset 788532 in 1 ms.
(kafka.log.Log)
2019/11/19 05:31:01 ERROR [ReplicaManager broker=0] Error processing append
operation on partition matchmaker_retry_clicks_15m-2
(kafka.server.ReplicaManager)
2019/11/19 05:31:01 java.nio.BufferOverflowException
2019/11/19 05:31:01 at java.nio.Buffer.nextPutIndex(Buffer.java:527)
2019/11/19 05:31:01 at
java.nio.DirectByteBuffer.putLong(DirectByteBuffer.java:797)
2019/11/19 05:31:01 at
kafka.log.TimeIndex.$anonfun$maybeAppend$1(TimeIndex.scala:134)
2019/11/19 05:31:01 at
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
2019/11/19 05:31:01 at
kafka.utils.CoreUtils$.inLock(CoreUtils.scala:253)
2019/11/19 05:31:01 at
kafka.log.TimeIndex.maybeAppend(TimeIndex.scala:114)
2019/11/19 05:31:01 at
kafka.log.LogSegment.onBecomeInactiveSegment(LogSegment.scala:520)
2019/11/19 05:31:01 at kafka.log.Log.$anonfun$roll$8(Log.scala:1690)
2019/11/19 05:31:01 at
kafka.log.Log.$anonfun$roll$8$adapted(Log.scala:1690)
2019/11/19 05:31:01 at scala.Option.foreach(Option.scala:407)
2019/11/19 05:31:01 at kafka.log.Log.$anonfun$roll$2(Log.scala:1690)
2019/11/19 05:31:01 at
kafka.log.Log.maybeHandleIOException(Log.scala:2085)
2019/11/19 05:31:01 at kafka.log.Log.roll(Log.scala:1654)
2019/11/19 05:31:01 at kafka.log.Log.maybeRoll(Log.scala:1639)
2019/11/19 05:31:01 at kafka.log.Log.$anonfun$append$2(Log.scala:966)
2019/11/19 05:31:01 at
kafka.log.Log.maybeHandleIOException(Log.scala:2085)
2019/11/19 05:31:01 at kafka.log.Log.append(Log.scala:850)
2019/11/19 05:31:01 at kafka.log.Log.appendAsLeader(Log.scala:819)
2019/11/19 05:31:01 at
kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:772)
2019/11/19 05:31:01 at
kafka.utils.CoreUtils$.inLock(CoreUtils.scala:253)
2019/11/19 05:31:01 at
kafka.utils.CoreUtils$.inReadLock(CoreUtils.scala:259)
2019/11/19 05:31:01 at
kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:759)
2019/11/19 05:31:01 at
kafka.server.ReplicaManager.$anonfun$appendToLocalLog$2(ReplicaManager.scala:763)
2019/11/19 05:31:01 at
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
2019/11/19 05:31:01 at
scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149)
2019/11/19 05:31:01 at
scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237)
2019/11/19 05:31:01 at
scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230)
2019/11/19 05:31:01 at
scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44)
2019/11/19 05:31:01 at
scala.collection.mutable.HashMap.foreach(HashMap.scala:149)
2019/11/19 05:31:01 at
scala.collection.TraversableLike.map(TraversableLike.scala:238)
2019/11/19 05:31:01 at
scala.collection.TraversableLike.map$(TraversableLike.scala:231)
2019/11/19 05:31:01 at
scala.collection.AbstractTraversable.map(Traversable.scala:108)
2019/11/19 05:31:01 at
kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:751)
2019/11/19 05:31:01 at
kafka.server.ReplicaManager.appendRecords(ReplicaManager.scala:492)
2019/11/19 05:31:01 at
kafka.server.KafkaApis.handleProduceRequest(KafkaApis.scala:544)
2019/11/19 05:31:01 at
kafka.server.KafkaApis.handle(KafkaApis.scala:113)
2019/11/19 05:31:01 at
kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:69)
2019/11/19 05:31:01 at java.lang.Thread.run(Thread.java:748)

{code}
The error persists until broker gets restarted (or leadership gets moved to 
another broker).

 

Brokers config:
{code:java}
advertised.host.name={{ hostname }}
port=9092

# Default number of partitions if a value isn't set when the topic is created.

[jira] [Commented] (KAFKA-9205) Add an option to enforce rack-aware partition reassignment

2019-11-19 Thread sats (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977902#comment-16977902
 ] 

sats commented on KAFKA-9205:
-

Cool let me dig into it, thanks. [~vahid] 

> Add an option to enforce rack-aware partition reassignment
> --
>
> Key: KAFKA-9205
> URL: https://issues.apache.org/jira/browse/KAFKA-9205
> Project: Kafka
>  Issue Type: Improvement
>  Components: admin, tools
>Reporter: Vahid Hashemian
>Priority: Minor
>
> One regularly used healing operation on Kafka clusters is replica 
> reassignments for topic partitions. For example, when there is a skew in 
> inbound/outbound traffic of a broker replica reassignment can be used to move 
> some leaders/followers from the broker; or if there is a skew in disk usage 
> of brokers, replica reassignment can more some partitions to other brokers 
> that have more disk space available.
> In Kafka clusters that span across multiple data centers (or availability 
> zones), high availability is a priority; in the sense that when a data center 
> goes offline the cluster should be able to resume normal operation by 
> guaranteeing partition replicas in all data centers.
> This guarantee is currently the responsibility of the on-call engineer that 
> performs the reassignment or the tool that automatically generates the 
> reassignment plan for improving the cluster health (e.g. by considering the 
> rack configuration value of each broker in the cluster). the former, is quite 
> error-prone, and the latter, would lead to duplicate code in all such admin 
> tools (which are not error free either). Not all use cases can make use the 
> default assignment strategy that is used by --generate option; and current 
> rack aware enforcement applies to this option only.
> It would be great for the built-in replica assignment API and tool provided 
> by Kafka to support a rack aware verification option for --execute scenario 
> that would simply return an error when [some] brokers in any replica set 
> share a common rack. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-9212) Keep receiving FENCED_LEADER_EPOCH while sending ListOffsetRequest

2019-11-19 Thread Yannick (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yannick updated KAFKA-9212:
---
Description: 
When running Kafka connect s3 sink connector ( confluent 5.3.0), after one 
broker got restarted (leaderEpoch updated at this point), the connect worker 
crashed with the following error : 

[2019-11-19 16:20:30,097] ERROR [Worker clientId=connect-1, groupId=connect-ls] 
Uncaught exception in herder work thread, exiting: 
(org.apache.kafka.connect.runtime.distributed.DistributedHerder:253)
 org.apache.kafka.common.errors.TimeoutException: Failed to get offsets by 
times in 30003ms

 

After investigation, it seems it's because it got fenced when sending 
ListOffsetRequest in loop and then got timed out , as follows :

[2019-11-19 16:20:30,020] DEBUG [Consumer clientId=consumer-3, 
groupId=connect-ls] Sending ListOffsetRequest (type=ListOffsetRequest, 
replicaId=-1, partitionTimestamps={connect_ls_config-0={timestamp: -1, 
maxNumOffsets: 1, currentLeaderEpoch: Optional[1]}}, 
isolationLevel=READ_UNCOMMITTED) to broker kafka6.fra2.internal:9092 (id: 4 
rack: null) (org.apache.kafka.clients.consumer.internals.Fetcher:905)

[2019-11-19 16:20:30,044] DEBUG [Consumer clientId=consumer-3, 
groupId=connect-ls] Attempt to fetch offsets for partition connect_ls_config-0 
failed due to FENCED_LEADER_EPOCH, retrying. 
(org.apache.kafka.clients.consumer.internals.Fetcher:985)

 

The above happens multiple times until timeout.

 

According to the debugs, the consumer always get a leaderEpoch of 1 for this 
topic when starting up :

 
 [2019-11-19 13:27:30,802] DEBUG [Consumer clientId=consumer-3, 
groupId=connect-ls] Updating last seen epoch from null to 1 for partition 
connect_ls_config-0 (org.apache.kafka.clients.Metadata:178)
  
  
 But according to our brokers log, the leaderEpoch should be 2, as follows :
  
 [2019-11-18 14:19:28,988] INFO [Partition connect_ls_config-0 broker=4] 
connect_ls_config-0 starts at Leader Epoch 2 from offset 22. Previous Leader 
Epoch was: 1 (kafka.cluster.Partition)
  
  
 This make impossible to restart the worker as it will always get fenced and 
then finally timeout.
  
 It is also impossible to consume with a 2.3 kafka-console-consumer as follows :
  
 kafka-console-consumer --bootstrap-server BOOTSTRAPSERVER:9092 --topic 
connect_ls_config --from-beginning 
  
 the above will just hang forever ( which is not expected cause there is data) 
and we can see those debug messages :

[2019-11-19 22:17:59,124] DEBUG [Consumer clientId=consumer-1, 
groupId=console-consumer-3844] Attempt to fetch offsets for partition 
connect_ls_config-0 failed due to FENCED_LEADER_EPOCH, retrying. 
(org.apache.kafka.clients.consumer.internals.Fetcher)
  
  
 Interesting fact, if we do subscribe the same way with kafkacat (1.5.0) we can 
consume without problem ( must be the way kafkacat is consuming which is 
different somehow):
  
 kafkacat -b BOOTSTRAPSERVER:9092 -t connect_ls_config -o beginning
  
  

  was:
When running Kafka connect s3 sink connector ( confluent 5.3.0), after one 
broker got restarted (leaderEpoch updated at this point), the connect worker 
crashed with the following error : 

[2019-11-19 16:20:30,097] ERROR [Worker clientId=connect-1, groupId=connect-ls] 
Uncaught exception in herder work thread, exiting: 
(org.apache.kafka.connect.runtime.distributed.DistributedHerder:253)
 org.apache.kafka.common.errors.TimeoutException: Failed to get offsets by 
times in 30003ms

 

After investigation, it seems it's because it got fenced when sending 
ListOffsetRequest in loop and then got timed out , as follows :

[2019-11-19 16:20:30,020] DEBUG [Consumer clientId=consumer-3, 
groupId=connect-ls] Sending ListOffsetRequest (type=ListOffsetRequest, 
replicaId=-1, partitionTimestamps={connect_ls_config-0={timestamp: -1, 
maxNumOffsets: 1, currentLeaderEpoch: Optional[1]}}, 
isolationLevel=READ_UNCOMMITTED) to broker kafka6.fra2.internal:9092 (id: 4 
rack: null) (org.apache.kafka.clients.consumer.internals.Fetcher:905)

[2019-11-19 16:20:30,044] DEBUG [Consumer clientId=consumer-3, 
groupId=connect-ls] Attempt to fetch offsets for partition connect_ls_config-0 
failed due to FENCED_LEADER_EPOCH, retrying. 
(org.apache.kafka.clients.consumer.internals.Fetcher:985)

 

The above happens multiple times until timeout.

 

According to the debugs, the consumer always get a leaderEpoch of 1 for this 
topic when starting up :

 
 [2019-11-19 13:27:30,802] DEBUG [Consumer clientId=consumer-3, 
groupId=connect-ls] Updating last seen epoch from null to 1 for partition 
connect_ls_config-0 (org.apache.kafka.clients.Metadata:178)
  
  
 But according to our brokers log, the leaderEpoch should be 2, as follows :
  
 [2019-11-18 14:19:28,988] INFO [Partition connect_ls_config-0 broker=4] 
connect_ls_config-0 starts at Leader Epoch 2 from offset 22. Previous Leader 
Epoch was: 1 (kafka.cluster.Partition)
  
  

[jira] [Updated] (KAFKA-9212) Keep receiving FENCED_LEADER_EPOCH while sending ListOffsetRequest

2019-11-19 Thread Yannick (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yannick updated KAFKA-9212:
---
Description: 
When running Kafka connect s3 sink connector ( confluent 5.3.0), after one 
broker got restarted (leaderEpoch updated at this point), the connect worker 
crashed with the following error : 

[2019-11-19 16:20:30,097] ERROR [Worker clientId=connect-1, groupId=connect-ls] 
Uncaught exception in herder work thread, exiting: 
(org.apache.kafka.connect.runtime.distributed.DistributedHerder:253)
 org.apache.kafka.common.errors.TimeoutException: Failed to get offsets by 
times in 30003ms

 

After investigation, it seems it's because it got fenced when sending 
ListOffsetRequest in loop and then got timed out , as follows :

[2019-11-19 16:20:30,020] DEBUG [Consumer clientId=consumer-3, 
groupId=connect-ls] Sending ListOffsetRequest (type=ListOffsetRequest, 
replicaId=-1, partitionTimestamps={connect_ls_config-0={timestamp: -1, 
maxNumOffsets: 1, currentLeaderEpoch: Optional[1]}}, 
isolationLevel=READ_UNCOMMITTED) to broker kafka6.fra2.internal:9092 (id: 4 
rack: null) (org.apache.kafka.clients.consumer.internals.Fetcher:905)

[2019-11-19 16:20:30,044] DEBUG [Consumer clientId=consumer-3, 
groupId=connect-ls] Attempt to fetch offsets for partition connect_ls_config-0 
failed due to FENCED_LEADER_EPOCH, retrying. 
(org.apache.kafka.clients.consumer.internals.Fetcher:985)

 

This multiple times until timeout.

 

According to the debugs, the consumer always get a leaderEpoch of 1 for this 
topic when starting up :

 
 [2019-11-19 13:27:30,802] DEBUG [Consumer clientId=consumer-3, 
groupId=connect-ls] Updating last seen epoch from null to 1 for partition 
connect_ls_config-0 (org.apache.kafka.clients.Metadata:178)
  
  
 But according to our brokers log, the leaderEpoch should be 2, as follows :
  
 [2019-11-18 14:19:28,988] INFO [Partition connect_ls_config-0 broker=4] 
connect_ls_config-0 starts at Leader Epoch 2 from offset 22. Previous Leader 
Epoch was: 1 (kafka.cluster.Partition)
  
  
 This make impossible to restart the worker as it will always get fenced and 
then finally timeout.
  
 It is also impossible to consumer with a 2.3 kafka-console-consumer as follows 
:
  
 kafka-console-consumer --bootstrap-server BOOTSTRAPSERVER:9092 --topic 
connect_ls_config --from-beginning 
  
 the above will just hang forever ( which is not expected cause there is data)
  
  
 Interesting fact, if we do subscribe the same way with kafkacat (1.5.0) we can 
consume without problem ( must be the way kafkacat is consuming which is 
different somehow):
  
 kafkacat -b BOOTSTRAPSERVER:9092 -t connect_ls_config -o beginning
  
  

  was:
When running Kafka connect s3 sink connector ( confluent 5.3.0), after one 
broker got restarted, the connect worker crashed with the following error : 

[2019-11-19 16:20:30,097] ERROR [Worker clientId=connect-1, groupId=connect-ls] 
Uncaught exception in herder work thread, exiting: 
(org.apache.kafka.connect.runtime.distributed.DistributedHerder:253)
org.apache.kafka.common.errors.TimeoutException: Failed to get offsets by times 
in 30003ms

 

After investigation, it seems it's because it got fenced when sending 
ListOffsetRequest in loop and then got timed out , as follows :

[2019-11-19 16:20:30,020] DEBUG [Consumer clientId=consumer-3, 
groupId=connect-ls] Sending ListOffsetRequest (type=ListOffsetRequest, 
replicaId=-1, partitionTimestamps=\{connect_ls_config-0={timestamp: -1, 
maxNumOffsets: 1, currentLeaderEpoch: Optional[1]}}, 
isolationLevel=READ_UNCOMMITTED) to broker kafka6.fra2.internal:9092 (id: 4 
rack: null) (org.apache.kafka.clients.consumer.internals.Fetcher:905)

[2019-11-19 16:20:30,044] DEBUG [Consumer clientId=consumer-3, 
groupId=connect-ls] Attempt to fetch offsets for partition connect_ls_config-0 
failed due to FENCED_LEADER_EPOCH, retrying. 
(org.apache.kafka.clients.consumer.internals.Fetcher:985)

 

This multiple times until timeout.

 

According to the debugs, the consumer always get a leaderEpoch of 1 for this 
topic when starting up :

 
[2019-11-19 13:27:30,802] DEBUG [Consumer clientId=consumer-3, 
groupId=connect-ls] Updating last seen epoch from null to 1 for partition 
connect_ls_config-0 (org.apache.kafka.clients.Metadata:178)
 
 
But according to our brokers log, the leaderEpoch should be 2, as follows :
 
[2019-11-18 14:19:28,988] INFO [Partition connect_ls_config-0 broker=4] 
connect_ls_config-0 starts at Leader Epoch 2 from offset 22. Previous Leader 
Epoch was: 1 (kafka.cluster.Partition)
 
 
This make impossible to restart the worker as it will always get fenced and 
then finally timeout.
 
It is also impossible to consumer with a 2.3 kafka-console-consumer as follows :
 
kafka-console-consumer --bootstrap-server BOOTSTRAPSERVER:9092 --topic 
connect_ls_config --from-beginning 
 
the above will just hang forever ( which is not expected cause there 

[jira] [Updated] (KAFKA-9212) Keep receiving FENCED_LEADER_EPOCH while sending ListOffsetRequest

2019-11-19 Thread Yannick (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yannick updated KAFKA-9212:
---
Description: 
When running Kafka connect s3 sink connector ( confluent 5.3.0), after one 
broker got restarted (leaderEpoch updated at this point), the connect worker 
crashed with the following error : 

[2019-11-19 16:20:30,097] ERROR [Worker clientId=connect-1, groupId=connect-ls] 
Uncaught exception in herder work thread, exiting: 
(org.apache.kafka.connect.runtime.distributed.DistributedHerder:253)
 org.apache.kafka.common.errors.TimeoutException: Failed to get offsets by 
times in 30003ms

 

After investigation, it seems it's because it got fenced when sending 
ListOffsetRequest in loop and then got timed out , as follows :

[2019-11-19 16:20:30,020] DEBUG [Consumer clientId=consumer-3, 
groupId=connect-ls] Sending ListOffsetRequest (type=ListOffsetRequest, 
replicaId=-1, partitionTimestamps={connect_ls_config-0={timestamp: -1, 
maxNumOffsets: 1, currentLeaderEpoch: Optional[1]}}, 
isolationLevel=READ_UNCOMMITTED) to broker kafka6.fra2.internal:9092 (id: 4 
rack: null) (org.apache.kafka.clients.consumer.internals.Fetcher:905)

[2019-11-19 16:20:30,044] DEBUG [Consumer clientId=consumer-3, 
groupId=connect-ls] Attempt to fetch offsets for partition connect_ls_config-0 
failed due to FENCED_LEADER_EPOCH, retrying. 
(org.apache.kafka.clients.consumer.internals.Fetcher:985)

 

The above happens multiple times until timeout.

 

According to the debugs, the consumer always get a leaderEpoch of 1 for this 
topic when starting up :

 
 [2019-11-19 13:27:30,802] DEBUG [Consumer clientId=consumer-3, 
groupId=connect-ls] Updating last seen epoch from null to 1 for partition 
connect_ls_config-0 (org.apache.kafka.clients.Metadata:178)
  
  
 But according to our brokers log, the leaderEpoch should be 2, as follows :
  
 [2019-11-18 14:19:28,988] INFO [Partition connect_ls_config-0 broker=4] 
connect_ls_config-0 starts at Leader Epoch 2 from offset 22. Previous Leader 
Epoch was: 1 (kafka.cluster.Partition)
  
  
 This make impossible to restart the worker as it will always get fenced and 
then finally timeout.
  
 It is also impossible to consumer with a 2.3 kafka-console-consumer as follows 
:
  
 kafka-console-consumer --bootstrap-server BOOTSTRAPSERVER:9092 --topic 
connect_ls_config --from-beginning 
  
 the above will just hang forever ( which is not expected cause there is data)
  
  
 Interesting fact, if we do subscribe the same way with kafkacat (1.5.0) we can 
consume without problem ( must be the way kafkacat is consuming which is 
different somehow):
  
 kafkacat -b BOOTSTRAPSERVER:9092 -t connect_ls_config -o beginning
  
  

  was:
When running Kafka connect s3 sink connector ( confluent 5.3.0), after one 
broker got restarted (leaderEpoch updated at this point), the connect worker 
crashed with the following error : 

[2019-11-19 16:20:30,097] ERROR [Worker clientId=connect-1, groupId=connect-ls] 
Uncaught exception in herder work thread, exiting: 
(org.apache.kafka.connect.runtime.distributed.DistributedHerder:253)
 org.apache.kafka.common.errors.TimeoutException: Failed to get offsets by 
times in 30003ms

 

After investigation, it seems it's because it got fenced when sending 
ListOffsetRequest in loop and then got timed out , as follows :

[2019-11-19 16:20:30,020] DEBUG [Consumer clientId=consumer-3, 
groupId=connect-ls] Sending ListOffsetRequest (type=ListOffsetRequest, 
replicaId=-1, partitionTimestamps={connect_ls_config-0={timestamp: -1, 
maxNumOffsets: 1, currentLeaderEpoch: Optional[1]}}, 
isolationLevel=READ_UNCOMMITTED) to broker kafka6.fra2.internal:9092 (id: 4 
rack: null) (org.apache.kafka.clients.consumer.internals.Fetcher:905)

[2019-11-19 16:20:30,044] DEBUG [Consumer clientId=consumer-3, 
groupId=connect-ls] Attempt to fetch offsets for partition connect_ls_config-0 
failed due to FENCED_LEADER_EPOCH, retrying. 
(org.apache.kafka.clients.consumer.internals.Fetcher:985)

 

This multiple times until timeout.

 

According to the debugs, the consumer always get a leaderEpoch of 1 for this 
topic when starting up :

 
 [2019-11-19 13:27:30,802] DEBUG [Consumer clientId=consumer-3, 
groupId=connect-ls] Updating last seen epoch from null to 1 for partition 
connect_ls_config-0 (org.apache.kafka.clients.Metadata:178)
  
  
 But according to our brokers log, the leaderEpoch should be 2, as follows :
  
 [2019-11-18 14:19:28,988] INFO [Partition connect_ls_config-0 broker=4] 
connect_ls_config-0 starts at Leader Epoch 2 from offset 22. Previous Leader 
Epoch was: 1 (kafka.cluster.Partition)
  
  
 This make impossible to restart the worker as it will always get fenced and 
then finally timeout.
  
 It is also impossible to consumer with a 2.3 kafka-console-consumer as follows 
:
  
 kafka-console-consumer --bootstrap-server BOOTSTRAPSERVER:9092 --topic 
connect_ls_config --from-beginning 
  
 

[jira] [Created] (KAFKA-9212) Keep receiving FENCED_LEADER_EPOCH while sending ListOffsetRequest

2019-11-19 Thread Yannick (Jira)
Yannick created KAFKA-9212:
--

 Summary: Keep receiving FENCED_LEADER_EPOCH while sending 
ListOffsetRequest
 Key: KAFKA-9212
 URL: https://issues.apache.org/jira/browse/KAFKA-9212
 Project: Kafka
  Issue Type: Bug
  Components: consumer, offset manager
Affects Versions: 2.3.0
 Environment: Linux
Reporter: Yannick


When running Kafka connect s3 sink connector ( confluent 5.3.0), after one 
broker got restarted, the connect worker crashed with the following error : 

[2019-11-19 16:20:30,097] ERROR [Worker clientId=connect-1, groupId=connect-ls] 
Uncaught exception in herder work thread, exiting: 
(org.apache.kafka.connect.runtime.distributed.DistributedHerder:253)
org.apache.kafka.common.errors.TimeoutException: Failed to get offsets by times 
in 30003ms

 

After investigation, it seems it's because it got fenced when sending 
ListOffsetRequest in loop and then got timed out , as follows :

[2019-11-19 16:20:30,020] DEBUG [Consumer clientId=consumer-3, 
groupId=connect-ls] Sending ListOffsetRequest (type=ListOffsetRequest, 
replicaId=-1, partitionTimestamps=\{connect_ls_config-0={timestamp: -1, 
maxNumOffsets: 1, currentLeaderEpoch: Optional[1]}}, 
isolationLevel=READ_UNCOMMITTED) to broker kafka6.fra2.internal:9092 (id: 4 
rack: null) (org.apache.kafka.clients.consumer.internals.Fetcher:905)

[2019-11-19 16:20:30,044] DEBUG [Consumer clientId=consumer-3, 
groupId=connect-ls] Attempt to fetch offsets for partition connect_ls_config-0 
failed due to FENCED_LEADER_EPOCH, retrying. 
(org.apache.kafka.clients.consumer.internals.Fetcher:985)

 

This multiple times until timeout.

 

According to the debugs, the consumer always get a leaderEpoch of 1 for this 
topic when starting up :

 
[2019-11-19 13:27:30,802] DEBUG [Consumer clientId=consumer-3, 
groupId=connect-ls] Updating last seen epoch from null to 1 for partition 
connect_ls_config-0 (org.apache.kafka.clients.Metadata:178)
 
 
But according to our brokers log, the leaderEpoch should be 2, as follows :
 
[2019-11-18 14:19:28,988] INFO [Partition connect_ls_config-0 broker=4] 
connect_ls_config-0 starts at Leader Epoch 2 from offset 22. Previous Leader 
Epoch was: 1 (kafka.cluster.Partition)
 
 
This make impossible to restart the worker as it will always get fenced and 
then finally timeout.
 
It is also impossible to consumer with a 2.3 kafka-console-consumer as follows :
 
kafka-console-consumer --bootstrap-server BOOTSTRAPSERVER:9092 --topic 
connect_ls_config --from-beginning 
 
the above will just hang forever ( which is not expected cause there is data)
 
 
Interesting fact, if we do subscribe the same way with kafkacat (1.5.0) we can 
consume without problem ( must be the way kafkacat is consuming which is 
different somehow):
 
kafkacat -b BOOTSTRAPSERVER:9092 -t connect_ls_config -o beginning
 
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9178) restoredPartitions is not cleared until the last restoring task completes

2019-11-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977854#comment-16977854
 ] 

ASF GitHub Bot commented on KAFKA-9178:
---

ableegoldman commented on pull request #7686: KAFKA-9178: restoredPartitions is 
not cleared until the last restoring task completes
URL: https://github.com/apache/kafka/pull/7686
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> restoredPartitions is not cleared until the last restoring task completes
> -
>
> Key: KAFKA-9178
> URL: https://issues.apache.org/jira/browse/KAFKA-9178
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.4.0
>Reporter: Boyang Chen
>Assignee: Boyang Chen
>Priority: Blocker
>  Labels: streams
> Fix For: 2.4.0
>
>
> We check the `active` set is empty during closeLostTasks(). However we don't 
> currently properly clear the {{restoredPartitions}} set in some edge cases:
> We only remove partitions from {{restoredPartitions}} when a) all tasks are 
> done restoring, at which point we clear it entirely(in 
> {{AssignedStreamTasks#updateRestored}}), or b) one task at a time, when that 
> task is restoring and is closed.
> Say some partitions were still restoring while others had completed and 
> transitioned to running when a rebalance occurs. The still-restoring tasks 
> are all revoked, and closed immediately, and their partitions removed from 
> {{restoredPartitions}}. We also suspend & revoke some running tasks that have 
> finished restoring, and remove them from {{running}}/{{runningByPartition}}.
> Now we have only running tasks left, so in 
> {{TaskManager#updateNewAndRestoringTasks}} we don’t ever even call 
> {{AssignedStreamTasks#updateRestored }}and therefore we never get to clear 
> {{restoredPartitions}}. We then close each of the currently running tasks and 
> remove their partitions from everything, BUT we never got to remove or clear 
> the partitions of the running tasks that we revoked previously.
> It turns out we can't just rely on removing from {{restoredPartitions }}upon 
> completion since the partitions will just be added back to it during the next 
> loop (blocked by KAFKA-9177). For now, we should just remove partitions from 
> {{restoredPartitions}} when closing or suspending running tasks as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-7500) MirrorMaker 2.0 (KIP-382)

2019-11-19 Thread Jan Arve Sundt (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1690#comment-1690
 ] 

Jan Arve Sundt commented on KAFKA-7500:
---

I'm testing to replicate all topics from one Kafka cluster to replica Kafka 
cluster(active/pasive), with the same topic name, include topic data, 
consumers' offset and configuration settings for topics. I can see topic data 
and consumers' offset, but I am not able to see the configuration settings for 
topic. I also need to have the same name for topic in replica. Can anyone 
explain what I am doing wrong?

mm2.properties settings:

connector.class = org.apache.kafka.connect.mirror.MirrorSourceConnector
clusters = A1,B2
A1.bootstrap.servers = host:port
B2.bootstrap.servers = host:port
A1->B2.enabled = true
A1->B2.topics = test-topic
rename.topics = true
sync.topic.configs = true
sync.topic.acls = false

 

 

> MirrorMaker 2.0 (KIP-382)
> -
>
> Key: KAFKA-7500
> URL: https://issues.apache.org/jira/browse/KAFKA-7500
> Project: Kafka
>  Issue Type: New Feature
>  Components: KafkaConnect, mirrormaker
>Affects Versions: 2.4.0
>Reporter: Ryanne Dolan
>Assignee: Ryanne Dolan
>Priority: Major
>  Labels: pull-request-available, ready-to-commit
> Fix For: 2.4.0
>
> Attachments: Active-Active XDCR setup.png
>
>
> Implement a drop-in replacement for MirrorMaker leveraging the Connect 
> framework.
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-382%3A+MirrorMaker+2.0]
> [https://github.com/apache/kafka/pull/6295]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-9086) Refactor Processor Node Streams Metrics

2019-11-19 Thread Bill Bejeck (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Bejeck resolved KAFKA-9086.

Resolution: Fixed

> Refactor Processor Node Streams Metrics
> ---
>
> Key: KAFKA-9086
> URL: https://issues.apache.org/jira/browse/KAFKA-9086
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Bruno Cadonna
>Assignee: Bruno Cadonna
>Priority: Major
> Fix For: 2.5.0
>
>
> Refactor processor node metrics as described in KIP-444. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9086) Refactor Processor Node Streams Metrics

2019-11-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977668#comment-16977668
 ] 

ASF GitHub Bot commented on KAFKA-9086:
---

bbejeck commented on pull request #7615: KAFKA-9086: Refactor 
processor-node-level metrics
URL: https://github.com/apache/kafka/pull/7615
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Refactor Processor Node Streams Metrics
> ---
>
> Key: KAFKA-9086
> URL: https://issues.apache.org/jira/browse/KAFKA-9086
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Bruno Cadonna
>Assignee: Bruno Cadonna
>Priority: Major
> Fix For: 2.5.0
>
>
> Refactor processor node metrics as described in KIP-444. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9205) Add an option to enforce rack-aware partition reassignment

2019-11-19 Thread Vahid Hashemian (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977592#comment-16977592
 ] 

Vahid Hashemian commented on KAFKA-9205:


[~sbellapu] KIP process is not that difficult. If you have access to the wiki 
you can easily create one and start discussion on it in the mailing list (and 
after enough discussion/time you do a vote). The KIP page has all the necessary 
info: 
[https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals].
 You can also take some of the recent KIPs as an example.

Since there is an existing option for disabling rack aware mode, this change 
should be designed in a way that either makes use of that option, or works well 
alongside it (without causing confusion); and at the same time preserves 
backward compatibility (i.e. existing default behavior should ideally not 
change).

> Add an option to enforce rack-aware partition reassignment
> --
>
> Key: KAFKA-9205
> URL: https://issues.apache.org/jira/browse/KAFKA-9205
> Project: Kafka
>  Issue Type: Improvement
>  Components: admin, tools
>Reporter: Vahid Hashemian
>Priority: Minor
>
> One regularly used healing operation on Kafka clusters is replica 
> reassignments for topic partitions. For example, when there is a skew in 
> inbound/outbound traffic of a broker replica reassignment can be used to move 
> some leaders/followers from the broker; or if there is a skew in disk usage 
> of brokers, replica reassignment can more some partitions to other brokers 
> that have more disk space available.
> In Kafka clusters that span across multiple data centers (or availability 
> zones), high availability is a priority; in the sense that when a data center 
> goes offline the cluster should be able to resume normal operation by 
> guaranteeing partition replicas in all data centers.
> This guarantee is currently the responsibility of the on-call engineer that 
> performs the reassignment or the tool that automatically generates the 
> reassignment plan for improving the cluster health (e.g. by considering the 
> rack configuration value of each broker in the cluster). the former, is quite 
> error-prone, and the latter, would lead to duplicate code in all such admin 
> tools (which are not error free either). Not all use cases can make use the 
> default assignment strategy that is used by --generate option; and current 
> rack aware enforcement applies to this option only.
> It would be great for the built-in replica assignment API and tool provided 
> by Kafka to support a rack aware verification option for --execute scenario 
> that would simply return an error when [some] brokers in any replica set 
> share a common rack. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-8803) Stream will not start due to TimeoutException: Timeout expired after 60000milliseconds while awaiting InitProducerId

2019-11-19 Thread Raman Gupta (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977577#comment-16977577
 ] 

Raman Gupta commented on KAFKA-8803:


We had this problem again today, and checked on each node restart whether the 
problem was fixed. It went away after restarting the third of four nodes.

> Stream will not start due to TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> 
>
> Key: KAFKA-8803
> URL: https://issues.apache.org/jira/browse/KAFKA-8803
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Raman Gupta
>Priority: Major
> Attachments: logs.txt.gz, screenshot-1.png
>
>
> One streams app is consistently failing at startup with the following 
> exception:
> {code}
> 2019-08-14 17:02:29,568 ERROR --- [2ce1b-StreamThread-2] 
> org.apa.kaf.str.pro.int.StreamTask: task [0_36] Timeout 
> exception caught when initializing transactions for task 0_36. This might 
> happen if the broker is slow to respond, if the network connection to the 
> broker was interrupted, or if similar circumstances arise. You can increase 
> producer parameter `max.block.ms` to increase this timeout.
> org.apache.kafka.common.errors.TimeoutException: Timeout expired after 
> 6milliseconds while awaiting InitProducerId
> {code}
> These same brokers are used by many other streams without any issue, 
> including some in the very same processes for the stream which consistently 
> throws this exception.
> *UPDATE 08/16:*
> The very first instance of this error is August 13th 2019, 17:03:36.754 and 
> it happened for 4 different streams. For 3 of these streams, the error only 
> happened once, and then the stream recovered. For the 4th stream, the error 
> has continued to happen, and continues to happen now.
> I looked up the broker logs for this time, and see that at August 13th 2019, 
> 16:47:43, two of four brokers started reporting messages like this, for 
> multiple partitions:
> [2019-08-13 20:47:43,658] INFO [ReplicaFetcher replicaId=3, leaderId=1, 
> fetcherId=0] Retrying leaderEpoch request for partition xxx-1 as the leader 
> reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
> The UNKNOWN_LEADER_EPOCH messages continued for some time, and then stopped, 
> here is a view of the count of these messages over time:
>  !screenshot-1.png! 
> However, as noted, the stream task timeout error continues to happen.
> I use the static consumer group protocol with Kafka 2.3.0 clients and 2.3.0 
> broker. The broker has a patch for KAFKA-8773.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9211) kafka upgrade 2.3.0 cause produce speed decrease

2019-11-19 Thread Ismael Juma (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977539#comment-16977539
 ] 

Ismael Juma commented on KAFKA-9211:


cc [~junrao]

> kafka upgrade 2.3.0 cause produce speed decrease
> 
>
> Key: KAFKA-9211
> URL: https://issues.apache.org/jira/browse/KAFKA-9211
> Project: Kafka
>  Issue Type: Bug
>  Components: controller, producer 
>Affects Versions: 2.3.0
>Reporter: li xiangyuan
>Priority: Critical
> Attachments: broker-jstack.txt, producer-jstack.txt
>
>
> Recently we try upgrade kafka from 0.10.0.1 to 2.3.0.
> we have 15 clusters in production env, each one has 3~6 brokers.
> we know kafka upgrade should:
>       1.replcae code to 2.3.0.jar and restart  all brokers one by one
>       2.unset inter.broker.protocol.version=0.10.0.1 and restart all brokers 
> one by one
>       3.unset log.message.format.version=0.10.0.1 and restart all brokers one 
> by one
>  
> for now we have already done step 1 & 2 in 12 clusters.but when we try to 
> upgrade left clusters (already done step 1) in step 2, we found some topics 
> drop produce speed badly.
>      we have research this issue for long time, since we couldn't test it in 
> production environment  and we couldn't reproduce in test environment, we 
> couldn't find the root cause.
> now we only could describe the situation in detail as  i know, hope anyone 
> could help us.
>  
> 1.because bug KAFKA-8653, i add code below in KafkaApis.scala 
> handleJoinGroupRequest function:
> {code:java}
> if (rebalanceTimeoutMs <= 0) {
>  rebalanceTimeoutMs = joinGroupRequest.data.sessionTimeoutMs
> }{code}
> 2.one cluster upgrade failed has 6 8C16G brokers, about 200 topics with 2 
> replicas,every broker keep 3000+ partitions and 1500+ leader partition, but 
> most of them has very low produce message speed,about less than 
> 50messages/sec, only one topic with 300 partitions has more than 2500 
> message/sec with more than 20 consumer groups consume message from it.
> so this whole cluster  produce 4K messages/sec , 11m Bytes in /sec,240m Bytes 
> out /sec.and more than 90% traffic made by that topic has 2500messages/sec.
> when we unset 5 or 6 servers' inter.broker.protocol.version=0.10.0.1  and 
> restart, this topic produce message drop to about 200messages/sec,  i don't 
> know whether the way we use could tirgger any problem.
> 3.we use kafka wrapped by spring-kafka and set kafkatemplate's 
> autoFlush=true, so each producer.send execution will execute producer.flush 
> immediately too.i know flush method will decrease produce performance 
> dramaticlly, but  at least it seems nothing wrong before upgrade step 2. but 
> i doubt whether it's a problem now after upgrade.
> 4.I noticed when produce speed decrease, some consumer group has large 
> message lag still consume message without any consume speed change or 
> decrease, so I guess only producerequest speed will drop down,but 
> fetchrequest not. 
> 5.we haven't set any throttle configuration, and all producers' acks=1(so 
> it's not broker replica fetch slow), and when this problem triggered, both 
> sever & producers cpu usage down, and servers' ioutil keep less than 30% ,so 
> it shuldn't be a hardware problem.
> 6.this event triggered often(almost 100%) most brokers has done upgrade step 
> 2,then after a auto leader replica election executed, then we can observe  
> produce speed drop down,and we have to downgrade brokers(set 
> inter.broker.protocol.version=0.10.0.1)and restart brokers one by one,then it 
> could be normal. some cluster have to downgrade all brokers,but some cluster 
> could left 1 or 2 brokers without downgrade, i notice that the broker not 
> need downgrade is the controller.
> 7.I have print jstack for producer & servers. although I do this not the same 
> cluster, but we can notice that their thread seems really in idle stat.
> 8.both 0.10.0.1 & 2.3.0 kafka-client will trigger this problem too.
> 8.unless the largest one topic will drop produce speed certainly, other topic 
> will drop produce speed randomly. maybe topicA will drop speed in first 
> upgrade attempt but next not, and topicB not drop speed in first attemp but 
> dropped when do another attempt.
> 9.in fact, the largest cluster, has the same topic & group usage scenario 
> mentioned above, but the largest topic has 1w2 messages/sec,will upgrade fail 
> in step 1(just use 2.3.0.jar)
> any help would be grateful, thx, i'm very sad now...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-9211) kafka upgrade 2.3.0 cause produce speed decrease

2019-11-19 Thread li xiangyuan (Jira)
li xiangyuan created KAFKA-9211:
---

 Summary: kafka upgrade 2.3.0 cause produce speed decrease
 Key: KAFKA-9211
 URL: https://issues.apache.org/jira/browse/KAFKA-9211
 Project: Kafka
  Issue Type: Bug
  Components: controller, producer 
Affects Versions: 2.3.0
Reporter: li xiangyuan
 Attachments: broker-jstack.txt, producer-jstack.txt

Recently we try upgrade kafka from 0.10.0.1 to 2.3.0.

we have 15 clusters in production env, each one has 3~6 brokers.

we know kafka upgrade should:
      1.replcae code to 2.3.0.jar and restart  all brokers one by one
      2.unset inter.broker.protocol.version=0.10.0.1 and restart all brokers 
one by one
      3.unset log.message.format.version=0.10.0.1 and restart all brokers one 
by one
 
for now we have already done step 1 & 2 in 12 clusters.but when we try to 
upgrade left clusters (already done step 1) in step 2, we found some topics 
drop produce speed badly.
     we have research this issue for long time, since we couldn't test it in 
production environment  and we couldn't reproduce in test environment, we 
couldn't find the root cause.
now we only could describe the situation in detail as  i know, hope anyone 
could help us.
 
1.because bug KAFKA-8653, i add code below in KafkaApis.scala 
handleJoinGroupRequest function:
{code:java}
if (rebalanceTimeoutMs <= 0) {
 rebalanceTimeoutMs = joinGroupRequest.data.sessionTimeoutMs
}{code}

2.one cluster upgrade failed has 6 8C16G brokers, about 200 topics with 2 
replicas,every broker keep 3000+ partitions and 1500+ leader partition, but 
most of them has very low produce message speed,about less than 50messages/sec, 
only one topic with 300 partitions has more than 2500 message/sec with more 
than 20 consumer groups consume message from it.

so this whole cluster  produce 4K messages/sec , 11m Bytes in /sec,240m Bytes 
out /sec.and more than 90% traffic made by that topic has 2500messages/sec.

when we unset 5 or 6 servers' inter.broker.protocol.version=0.10.0.1  and 
restart, this topic produce message drop to about 200messages/sec,  i don't 
know whether the way we use could tirgger any problem.

3.we use kafka wrapped by spring-kafka and set kafkatemplate's autoFlush=true, 
so each producer.send execution will execute producer.flush immediately too.i 
know flush method will decrease produce performance dramaticlly, but  at least 
it seems nothing wrong before upgrade step 2. but i doubt whether it's a 
problem now after upgrade.

4.I noticed when produce speed decrease, some consumer group has large message 
lag still consume message without any consume speed change or decrease, so I 
guess only producerequest speed will drop down,but fetchrequest not. 

5.we haven't set any throttle configuration, and all producers' acks=1(so it's 
not broker replica fetch slow), and when this problem triggered, both sever & 
producers cpu usage down, and servers' ioutil keep less than 30% ,so it 
shuldn't be a hardware problem.

6.this event triggered often(almost 100%) most brokers has done upgrade step 
2,then after a auto leader replica election executed, then we can observe  
produce speed drop down,and we have to downgrade brokers(set 
inter.broker.protocol.version=0.10.0.1)and restart brokers one by one,then it 
could be normal. some cluster have to downgrade all brokers,but some cluster 
could left 1 or 2 brokers without downgrade, i notice that the broker not need 
downgrade is the controller.

7.I have print jstack for producer & servers. although I do this not the same 
cluster, but we can notice that their thread seems really in idle stat.

8.both 0.10.0.1 & 2.3.0 kafka-client will trigger this problem too.

8.unless the largest one topic will drop produce speed certainly, other topic 
will drop produce speed randomly. maybe topicA will drop speed in first upgrade 
attempt but next not, and topicB not drop speed in first attemp but dropped 
when do another attempt.

9.in fact, the largest cluster, has the same topic & group usage scenario 
mentioned above, but the largest topic has 1w2 messages/sec,will upgrade fail 
in step 1(just use 2.3.0.jar)


any help would be grateful, thx, i'm very sad now...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9203) kafka-client 2.3.1 fails to consume lz4 compressed topic in kafka 2.1.1

2019-11-19 Thread Ismael Juma (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977528#comment-16977528
 ] 

Ismael Juma commented on KAFKA-9203:


Are you able to swap the lz4 jar when when running your app with kafka 2.3 to 
verify this?

> kafka-client 2.3.1 fails to consume lz4 compressed topic in kafka 2.1.1
> ---
>
> Key: KAFKA-9203
> URL: https://issues.apache.org/jira/browse/KAFKA-9203
> Project: Kafka
>  Issue Type: Bug
>  Components: compression, consumer
>Affects Versions: 2.3.0, 2.3.1
>Reporter: David Watzke
>Priority: Major
>
> I run kafka cluster 2.1.1
> when I upgraded a client app to use kafka-client 2.3.0 (or 2.3.1) instead of 
> 2.2.0, I immediately started getting the following exceptions in a loop when 
> consuming a topic with LZ4-compressed messages:
> {noformat}
> 2019-11-18 11:59:16.888 ERROR [afka-consumer-thread] 
> com.avast.utils2.kafka.consumer.ConsumerThread: Unexpected exception occurred 
> while polling and processing messages: org.apache.kafka.common.KafkaExce
> ption: Received exception when fetching the next record from 
> FILEREP_LONER_REQUEST-4. If needed, please seek past the record to continue 
> consumption. 
> org.apache.kafka.common.KafkaException: Received exception when fetching the 
> next record from FILEREP_LONER_REQUEST-4. If needed, please seek past the 
> record to continue consumption. 
>     at 
> org.apache.kafka.clients.consumer.internals.Fetcher$PartitionRecords.fetchRecords(Fetcher.java:1473)
>  
>     at 
> org.apache.kafka.clients.consumer.internals.Fetcher$PartitionRecords.access$1600(Fetcher.java:1332)
>  
>     at 
> org.apache.kafka.clients.consumer.internals.Fetcher.fetchRecords(Fetcher.java:645)
>  
>     at 
> org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:606)
>  
>     at 
> org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1263)
>  
>     at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1225) 
>     at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1201) 
>     at 
> com.avast.utils2.kafka.consumer.ConsumerThread.pollAndProcessMessagesFromKafka(ConsumerThread.java:180)
>  
>     at 
> com.avast.utils2.kafka.consumer.ConsumerThread.run(ConsumerThread.java:149) 
>     at 
> com.avast.filerep.saver.RequestSaver$.$anonfun$main$8(RequestSaver.scala:28) 
>     at 
> com.avast.filerep.saver.RequestSaver$.$anonfun$main$8$adapted(RequestSaver.scala:19)
>  
>     at 
> resource.AbstractManagedResource.$anonfun$acquireFor$1(AbstractManagedResource.scala:88)
>  
>     at 
> scala.util.control.Exception$Catch.$anonfun$either$1(Exception.scala:252) 
>     at scala.util.control.Exception$Catch.apply(Exception.scala:228) 
>     at scala.util.control.Exception$Catch.either(Exception.scala:252) 
>     at 
> resource.AbstractManagedResource.acquireFor(AbstractManagedResource.scala:88) 
>     at 
> resource.ManagedResourceOperations.apply(ManagedResourceOperations.scala:26) 
>     at 
> resource.ManagedResourceOperations.apply$(ManagedResourceOperations.scala:26) 
>     at 
> resource.AbstractManagedResource.apply(AbstractManagedResource.scala:50) 
>     at 
> resource.ManagedResourceOperations.acquireAndGet(ManagedResourceOperations.scala:25)
>  
>     at 
> resource.ManagedResourceOperations.acquireAndGet$(ManagedResourceOperations.scala:25)
>  
>     at 
> resource.AbstractManagedResource.acquireAndGet(AbstractManagedResource.scala:50)
>  
>     at 
> resource.ManagedResourceOperations.foreach(ManagedResourceOperations.scala:53)
>  
>     at 
> resource.ManagedResourceOperations.foreach$(ManagedResourceOperations.scala:53)
>  
>     at 
> resource.AbstractManagedResource.foreach(AbstractManagedResource.scala:50) 
>     at 
> com.avast.filerep.saver.RequestSaver$.$anonfun$main$6(RequestSaver.scala:19) 
>     at 
> com.avast.filerep.saver.RequestSaver$.$anonfun$main$6$adapted(RequestSaver.scala:18)
>  
>     at 
> resource.AbstractManagedResource.$anonfun$acquireFor$1(AbstractManagedResource.scala:88)
>  
>     at 
> scala.util.control.Exception$Catch.$anonfun$either$1(Exception.scala:252) 
>     at scala.util.control.Exception$Catch.apply(Exception.scala:228) 
>     at scala.util.control.Exception$Catch.either(Exception.scala:252) 
>     at 
> resource.AbstractManagedResource.acquireFor(AbstractManagedResource.scala:88) 
>     at 
> resource.ManagedResourceOperations.apply(ManagedResourceOperations.scala:26) 
>     at 
> resource.ManagedResourceOperations.apply$(ManagedResourceOperations.scala:26) 
>     at 
> 

[jira] [Comment Edited] (KAFKA-9203) kafka-client 2.3.1 fails to consume lz4 compressed topic in kafka 2.1.1

2019-11-19 Thread David Watzke (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977373#comment-16977373
 ] 

David Watzke edited comment on KAFKA-9203 at 11/19/19 10:55 AM:


producer is using kafka client 0.10.2.1

 

updated lz4-java lib does indeed sound like a very probable cause


was (Author: dwatzke):
producer is using kafka client 0.10.2.1

> kafka-client 2.3.1 fails to consume lz4 compressed topic in kafka 2.1.1
> ---
>
> Key: KAFKA-9203
> URL: https://issues.apache.org/jira/browse/KAFKA-9203
> Project: Kafka
>  Issue Type: Bug
>  Components: compression, consumer
>Affects Versions: 2.3.0, 2.3.1
>Reporter: David Watzke
>Priority: Major
>
> I run kafka cluster 2.1.1
> when I upgraded a client app to use kafka-client 2.3.0 (or 2.3.1) instead of 
> 2.2.0, I immediately started getting the following exceptions in a loop when 
> consuming a topic with LZ4-compressed messages:
> {noformat}
> 2019-11-18 11:59:16.888 ERROR [afka-consumer-thread] 
> com.avast.utils2.kafka.consumer.ConsumerThread: Unexpected exception occurred 
> while polling and processing messages: org.apache.kafka.common.KafkaExce
> ption: Received exception when fetching the next record from 
> FILEREP_LONER_REQUEST-4. If needed, please seek past the record to continue 
> consumption. 
> org.apache.kafka.common.KafkaException: Received exception when fetching the 
> next record from FILEREP_LONER_REQUEST-4. If needed, please seek past the 
> record to continue consumption. 
>     at 
> org.apache.kafka.clients.consumer.internals.Fetcher$PartitionRecords.fetchRecords(Fetcher.java:1473)
>  
>     at 
> org.apache.kafka.clients.consumer.internals.Fetcher$PartitionRecords.access$1600(Fetcher.java:1332)
>  
>     at 
> org.apache.kafka.clients.consumer.internals.Fetcher.fetchRecords(Fetcher.java:645)
>  
>     at 
> org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:606)
>  
>     at 
> org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1263)
>  
>     at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1225) 
>     at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1201) 
>     at 
> com.avast.utils2.kafka.consumer.ConsumerThread.pollAndProcessMessagesFromKafka(ConsumerThread.java:180)
>  
>     at 
> com.avast.utils2.kafka.consumer.ConsumerThread.run(ConsumerThread.java:149) 
>     at 
> com.avast.filerep.saver.RequestSaver$.$anonfun$main$8(RequestSaver.scala:28) 
>     at 
> com.avast.filerep.saver.RequestSaver$.$anonfun$main$8$adapted(RequestSaver.scala:19)
>  
>     at 
> resource.AbstractManagedResource.$anonfun$acquireFor$1(AbstractManagedResource.scala:88)
>  
>     at 
> scala.util.control.Exception$Catch.$anonfun$either$1(Exception.scala:252) 
>     at scala.util.control.Exception$Catch.apply(Exception.scala:228) 
>     at scala.util.control.Exception$Catch.either(Exception.scala:252) 
>     at 
> resource.AbstractManagedResource.acquireFor(AbstractManagedResource.scala:88) 
>     at 
> resource.ManagedResourceOperations.apply(ManagedResourceOperations.scala:26) 
>     at 
> resource.ManagedResourceOperations.apply$(ManagedResourceOperations.scala:26) 
>     at 
> resource.AbstractManagedResource.apply(AbstractManagedResource.scala:50) 
>     at 
> resource.ManagedResourceOperations.acquireAndGet(ManagedResourceOperations.scala:25)
>  
>     at 
> resource.ManagedResourceOperations.acquireAndGet$(ManagedResourceOperations.scala:25)
>  
>     at 
> resource.AbstractManagedResource.acquireAndGet(AbstractManagedResource.scala:50)
>  
>     at 
> resource.ManagedResourceOperations.foreach(ManagedResourceOperations.scala:53)
>  
>     at 
> resource.ManagedResourceOperations.foreach$(ManagedResourceOperations.scala:53)
>  
>     at 
> resource.AbstractManagedResource.foreach(AbstractManagedResource.scala:50) 
>     at 
> com.avast.filerep.saver.RequestSaver$.$anonfun$main$6(RequestSaver.scala:19) 
>     at 
> com.avast.filerep.saver.RequestSaver$.$anonfun$main$6$adapted(RequestSaver.scala:18)
>  
>     at 
> resource.AbstractManagedResource.$anonfun$acquireFor$1(AbstractManagedResource.scala:88)
>  
>     at 
> scala.util.control.Exception$Catch.$anonfun$either$1(Exception.scala:252) 
>     at scala.util.control.Exception$Catch.apply(Exception.scala:228) 
>     at scala.util.control.Exception$Catch.either(Exception.scala:252) 
>     at 
> resource.AbstractManagedResource.acquireFor(AbstractManagedResource.scala:88) 
>     at 
> resource.ManagedResourceOperations.apply(ManagedResourceOperations.scala:26) 
>     at 
> 

[jira] [Commented] (KAFKA-9203) kafka-client 2.3.1 fails to consume lz4 compressed topic in kafka 2.1.1

2019-11-19 Thread David Watzke (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977373#comment-16977373
 ] 

David Watzke commented on KAFKA-9203:
-

producer is using kafka client 0.10.2.1

> kafka-client 2.3.1 fails to consume lz4 compressed topic in kafka 2.1.1
> ---
>
> Key: KAFKA-9203
> URL: https://issues.apache.org/jira/browse/KAFKA-9203
> Project: Kafka
>  Issue Type: Bug
>  Components: compression, consumer
>Affects Versions: 2.3.0, 2.3.1
>Reporter: David Watzke
>Priority: Major
>
> I run kafka cluster 2.1.1
> when I upgraded a client app to use kafka-client 2.3.0 (or 2.3.1) instead of 
> 2.2.0, I immediately started getting the following exceptions in a loop when 
> consuming a topic with LZ4-compressed messages:
> {noformat}
> 2019-11-18 11:59:16.888 ERROR [afka-consumer-thread] 
> com.avast.utils2.kafka.consumer.ConsumerThread: Unexpected exception occurred 
> while polling and processing messages: org.apache.kafka.common.KafkaExce
> ption: Received exception when fetching the next record from 
> FILEREP_LONER_REQUEST-4. If needed, please seek past the record to continue 
> consumption. 
> org.apache.kafka.common.KafkaException: Received exception when fetching the 
> next record from FILEREP_LONER_REQUEST-4. If needed, please seek past the 
> record to continue consumption. 
>     at 
> org.apache.kafka.clients.consumer.internals.Fetcher$PartitionRecords.fetchRecords(Fetcher.java:1473)
>  
>     at 
> org.apache.kafka.clients.consumer.internals.Fetcher$PartitionRecords.access$1600(Fetcher.java:1332)
>  
>     at 
> org.apache.kafka.clients.consumer.internals.Fetcher.fetchRecords(Fetcher.java:645)
>  
>     at 
> org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:606)
>  
>     at 
> org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1263)
>  
>     at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1225) 
>     at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1201) 
>     at 
> com.avast.utils2.kafka.consumer.ConsumerThread.pollAndProcessMessagesFromKafka(ConsumerThread.java:180)
>  
>     at 
> com.avast.utils2.kafka.consumer.ConsumerThread.run(ConsumerThread.java:149) 
>     at 
> com.avast.filerep.saver.RequestSaver$.$anonfun$main$8(RequestSaver.scala:28) 
>     at 
> com.avast.filerep.saver.RequestSaver$.$anonfun$main$8$adapted(RequestSaver.scala:19)
>  
>     at 
> resource.AbstractManagedResource.$anonfun$acquireFor$1(AbstractManagedResource.scala:88)
>  
>     at 
> scala.util.control.Exception$Catch.$anonfun$either$1(Exception.scala:252) 
>     at scala.util.control.Exception$Catch.apply(Exception.scala:228) 
>     at scala.util.control.Exception$Catch.either(Exception.scala:252) 
>     at 
> resource.AbstractManagedResource.acquireFor(AbstractManagedResource.scala:88) 
>     at 
> resource.ManagedResourceOperations.apply(ManagedResourceOperations.scala:26) 
>     at 
> resource.ManagedResourceOperations.apply$(ManagedResourceOperations.scala:26) 
>     at 
> resource.AbstractManagedResource.apply(AbstractManagedResource.scala:50) 
>     at 
> resource.ManagedResourceOperations.acquireAndGet(ManagedResourceOperations.scala:25)
>  
>     at 
> resource.ManagedResourceOperations.acquireAndGet$(ManagedResourceOperations.scala:25)
>  
>     at 
> resource.AbstractManagedResource.acquireAndGet(AbstractManagedResource.scala:50)
>  
>     at 
> resource.ManagedResourceOperations.foreach(ManagedResourceOperations.scala:53)
>  
>     at 
> resource.ManagedResourceOperations.foreach$(ManagedResourceOperations.scala:53)
>  
>     at 
> resource.AbstractManagedResource.foreach(AbstractManagedResource.scala:50) 
>     at 
> com.avast.filerep.saver.RequestSaver$.$anonfun$main$6(RequestSaver.scala:19) 
>     at 
> com.avast.filerep.saver.RequestSaver$.$anonfun$main$6$adapted(RequestSaver.scala:18)
>  
>     at 
> resource.AbstractManagedResource.$anonfun$acquireFor$1(AbstractManagedResource.scala:88)
>  
>     at 
> scala.util.control.Exception$Catch.$anonfun$either$1(Exception.scala:252) 
>     at scala.util.control.Exception$Catch.apply(Exception.scala:228) 
>     at scala.util.control.Exception$Catch.either(Exception.scala:252) 
>     at 
> resource.AbstractManagedResource.acquireFor(AbstractManagedResource.scala:88) 
>     at 
> resource.ManagedResourceOperations.apply(ManagedResourceOperations.scala:26) 
>     at 
> resource.ManagedResourceOperations.apply$(ManagedResourceOperations.scala:26) 
>     at 
> resource.AbstractManagedResource.apply(AbstractManagedResource.scala:50) 
>     at 
> 

[jira] [Commented] (KAFKA-9157) logcleaner could generate empty segment files after cleaning

2019-11-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977306#comment-16977306
 ] 

ASF GitHub Bot commented on KAFKA-9157:
---

huxihx commented on pull request #7711: KAFKA-9157: Avoid generating empty 
segments if all records are deleted after cleaning
URL: https://github.com/apache/kafka/pull/7711
 
 
   https://issues.apache.org/jira/browse/KAFKA-9157
   
   If all records are deleted after cleaning, we should avoid generating empty 
log segments.
   
   *More detailed description of your change,
   if necessary. The PR title and PR message become
   the squashed commit message, so use a separate
   comment to ping reviewers.*
   
   *Summary of testing strategy (including rationale)
   for the feature or bug fix. Unit and/or integration
   tests are expected for any behaviour change and
   system tests should be considered for larger changes.*
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> logcleaner could generate empty segment files after cleaning
> 
>
> Key: KAFKA-9157
> URL: https://issues.apache.org/jira/browse/KAFKA-9157
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 2.3.0
>Reporter: Jun Rao
>Assignee: huxihx
>Priority: Major
>
> Currently, the log cleaner could only combine segments within a 2-billion 
> offset range. If all records in that range are deleted, an empty segment 
> could be generated. It would be useful to avoid generating such empty 
> segments.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-9210) kafka stream loss data

2019-11-19 Thread Matthias J. Sax (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias J. Sax updated KAFKA-9210:
---
Component/s: streams

> kafka stream loss data
> --
>
> Key: KAFKA-9210
> URL: https://issues.apache.org/jira/browse/KAFKA-9210
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.0.1
>Reporter: panpan.liu
>Priority: Major
> Attachments: app.log, screenshot-1.png
>
>
> kafka broker: 2.0.1
> kafka stream client: 2.1.0
>  # two applications run at the same time
>  # after some days,I stop one application(in k8s)
>  # The flollowing log occured and I check the data and find that value is 
> less than what I expected.
> {quote}Partitions 
> [flash-app-xmc-worker-share-store-minute-repartition-1]Partitions 
> [flash-app-xmc-worker-share-store-minute-repartition-1] for changelog 
> flash-app-xmc-worker-share-store-minute-changelog-12019-11-19 
> 05:50:49.816|WARN 
> |flash-client-xmc-StreamThread-3|o.a.k.s.p.i.StoreChangelogReader|101|stream-thread
>  [flash-client-xmc-StreamThread-3] Restoring StreamTasks failed. Deleting 
> StreamTasks stores to recreate from 
> scratch.org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets 
> out of range with no configured reset policy for partitions: 
> \{flash-app-xmc-worker-share-store-minute-changelog-1=6128684} at 
> org.apache.kafka.clients.consumer.internals.Fetcher.parseCompletedFetch(Fetcher.java:987)2019-11-19
>  05:50:49.817|INFO 
> |flash-client-xmc-StreamThread-3|o.a.k.s.p.i.StoreChangelogReader|105|stream-thread
>  [flash-client-xmc-StreamThread-3] Reinitializing StreamTask TaskId: 10_1 
> ProcessorTopology: KSTREAM-SOURCE-70: topics: 
> [flash-app-xmc-worker-share-store-minute-repartition] children: 
> [KSTREAM-AGGREGATE-67] KSTREAM-AGGREGATE-67: states: 
> [worker-share-store-minute] children: [KTABLE-TOSTREAM-71] 
> KTABLE-TOSTREAM-71: children: [KSTREAM-SINK-72] 
> KSTREAM-SINK-72: topic: 
> StaticTopicNameExtractor(xmc-worker-share-minute)Partitions 
> [flash-app-xmc-worker-share-store-minute-repartition-1] for changelog 
> flash-app-xmc-worker-share-store-minute-changelog-12019-11-19 
> 05:50:49.842|WARN 
> |flash-client-xmc-StreamThread-3|o.a.k.s.p.i.StoreChangelogReader|101|stream-thread
>  [flash-client-xmc-StreamThread-3] Restoring StreamTasks failed. Deleting 
> StreamTasks stores to recreate from 
> scratch.org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets 
> out of range with no configured reset policy for partitions: 
> \{flash-app-xmc-worker-share-store-minute-changelog-1=6128684} at 
> org.apache.kafka.clients.consumer.internals.Fetcher.parseCompletedFetch(Fetcher.java:987)2019-11-19
>  05:50:49.842|INFO 
> |flash-client-xmc-StreamThread-3|o.a.k.s.p.i.StoreChangelogReader|105|stream-thread
>  [flash-client-xmc-StreamThread-3] Reinitializing StreamTask TaskId: 10_1 
> ProcessorTopology: KSTREAM-SOURCE-70: topics: 
> [flash-app-xmc-worker-share-store-minute-repartition] children: 
> [KSTREAM-AGGREGATE-67] KSTREAM-AGGREGATE-67: states: 
> [worker-share-store-minute] children: [KTABLE-TOSTREAM-71] 
> KTABLE-TOSTREAM-71: children: [KSTREAM-SINK-72] 
> KSTREAM-SINK-72: topic: 
> StaticTopicNameExtractor(xmc-worker-share-minute)Partitions 
> [flash-app-xmc-worker-share-store-minute-repartition-1] for changelog 
> flash-app-xmc-worker-share-store-minute-changelog-12019-11-19 
> 05:50:49.905|WARN 
> |flash-client-xmc-StreamThread-3|o.a.k.s.p.i.StoreChangelogReader|101|stream-thread
>  [flash-client-xmc-StreamThread-3] Restoring StreamTasks failed. Deleting 
> StreamTasks stores to recreate from 
> scratch.org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets 
> out of range with no configured reset policy for partitions: 
> \{flash-app-xmc-worker-share-store-minute-changelog-1=6128684} at 
> org.apache.kafka.clients.consumer.internals.Fetcher.parseCompletedFetch(Fetcher.java:987)2019-11-19
>  05:50:49.906|INFO 
> |flash-client-xmc-StreamThread-3|o.a.k.s.p.i.StoreChangelogReader|105|stream-thread
>  [flash-client-xmc-StreamThread-3] Reinitializing StreamTask TaskId: 10_1 
> ProcessorTopology: KSTREAM-SOURCE-70: topics: 
> [flash-app-xmc-worker-share-store-minute-repartition] children: 
> [KSTREAM-AGGREGATE-67] KSTREAM-AGGREGATE-67: states: 
> [worker-share-store-minute] 
> {quote}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-9210) kafka stream loss data

2019-11-19 Thread Matthias J. Sax (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias J. Sax updated KAFKA-9210:
---
Description: 
kafka broker: 2.0.1

kafka stream client: 2.1.0
 # two applications run at the same time
 # after some days,I stop one application(in k8s)
 # The flollowing log occured and I check the data and find that value is less 
than what I expected.

{quote}Partitions 
[flash-app-xmc-worker-share-store-minute-repartition-1]Partitions 
[flash-app-xmc-worker-share-store-minute-repartition-1] for changelog 
flash-app-xmc-worker-share-store-minute-changelog-12019-11-19 05:50:49.816|WARN 
|flash-client-xmc-StreamThread-3|o.a.k.s.p.i.StoreChangelogReader|101|stream-thread
 [flash-client-xmc-StreamThread-3] Restoring StreamTasks failed. Deleting 
StreamTasks stores to recreate from 
scratch.org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets 
out of range with no configured reset policy for partitions: 
\{flash-app-xmc-worker-share-store-minute-changelog-1=6128684} at 
org.apache.kafka.clients.consumer.internals.Fetcher.parseCompletedFetch(Fetcher.java:987)2019-11-19
 05:50:49.817|INFO 
|flash-client-xmc-StreamThread-3|o.a.k.s.p.i.StoreChangelogReader|105|stream-thread
 [flash-client-xmc-StreamThread-3] Reinitializing StreamTask TaskId: 10_1 
ProcessorTopology: KSTREAM-SOURCE-70: topics: 
[flash-app-xmc-worker-share-store-minute-repartition] children: 
[KSTREAM-AGGREGATE-67] KSTREAM-AGGREGATE-67: states: 
[worker-share-store-minute] children: [KTABLE-TOSTREAM-71] 
KTABLE-TOSTREAM-71: children: [KSTREAM-SINK-72] 
KSTREAM-SINK-72: topic: 
StaticTopicNameExtractor(xmc-worker-share-minute)Partitions 
[flash-app-xmc-worker-share-store-minute-repartition-1] for changelog 
flash-app-xmc-worker-share-store-minute-changelog-12019-11-19 05:50:49.842|WARN 
|flash-client-xmc-StreamThread-3|o.a.k.s.p.i.StoreChangelogReader|101|stream-thread
 [flash-client-xmc-StreamThread-3] Restoring StreamTasks failed. Deleting 
StreamTasks stores to recreate from 
scratch.org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets 
out of range with no configured reset policy for partitions: 
\{flash-app-xmc-worker-share-store-minute-changelog-1=6128684} at 
org.apache.kafka.clients.consumer.internals.Fetcher.parseCompletedFetch(Fetcher.java:987)2019-11-19
 05:50:49.842|INFO 
|flash-client-xmc-StreamThread-3|o.a.k.s.p.i.StoreChangelogReader|105|stream-thread
 [flash-client-xmc-StreamThread-3] Reinitializing StreamTask TaskId: 10_1 
ProcessorTopology: KSTREAM-SOURCE-70: topics: 
[flash-app-xmc-worker-share-store-minute-repartition] children: 
[KSTREAM-AGGREGATE-67] KSTREAM-AGGREGATE-67: states: 
[worker-share-store-minute] children: [KTABLE-TOSTREAM-71] 
KTABLE-TOSTREAM-71: children: [KSTREAM-SINK-72] 
KSTREAM-SINK-72: topic: 
StaticTopicNameExtractor(xmc-worker-share-minute)Partitions 
[flash-app-xmc-worker-share-store-minute-repartition-1] for changelog 
flash-app-xmc-worker-share-store-minute-changelog-12019-11-19 05:50:49.905|WARN 
|flash-client-xmc-StreamThread-3|o.a.k.s.p.i.StoreChangelogReader|101|stream-thread
 [flash-client-xmc-StreamThread-3] Restoring StreamTasks failed. Deleting 
StreamTasks stores to recreate from 
scratch.org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets 
out of range with no configured reset policy for partitions: 
\{flash-app-xmc-worker-share-store-minute-changelog-1=6128684} at 
org.apache.kafka.clients.consumer.internals.Fetcher.parseCompletedFetch(Fetcher.java:987)2019-11-19
 05:50:49.906|INFO 
|flash-client-xmc-StreamThread-3|o.a.k.s.p.i.StoreChangelogReader|105|stream-thread
 [flash-client-xmc-StreamThread-3] Reinitializing StreamTask TaskId: 10_1 
ProcessorTopology: KSTREAM-SOURCE-70: topics: 
[flash-app-xmc-worker-share-store-minute-repartition] children: 
[KSTREAM-AGGREGATE-67] KSTREAM-AGGREGATE-67: states: 
[worker-share-store-minute] 
{quote}
 

  was:
kafka broker: 2.0.1

kafka stream client: 2.1.0
 # two applications run at the same time
 # after some days,I stop one application(in k8s)
 # The flollowing log occured and I check the data and find that value is less 
than what I expected.

{noformat}
Partitions [flash-app-xmc-worker-share-store-minute-repartition-1]Partitions 
[flash-app-xmc-worker-share-store-minute-repartition-1] for changelog 
flash-app-xmc-worker-share-store-minute-changelog-12019-11-19 05:50:49.816|WARN 
|flash-client-xmc-StreamThread-3|o.a.k.s.p.i.StoreChangelogReader|101|stream-thread
 [flash-client-xmc-StreamThread-3] Restoring StreamTasks failed. Deleting 
StreamTasks stores to recreate from 
scratch.org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets 
out of range with no configured reset policy for partitions: 

[jira] [Updated] (KAFKA-9210) kafka stream loss data

2019-11-19 Thread Matthias J. Sax (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias J. Sax updated KAFKA-9210:
---
Description: 
kafka broker: 2.0.1

kafka stream client: 2.1.0
 # two applications run at the same time
 # after some days,I stop one application(in k8s)
 # The flollowing log occured and I check the data and find that value is less 
than what I expected.

{noformat}
Partitions [flash-app-xmc-worker-share-store-minute-repartition-1]Partitions 
[flash-app-xmc-worker-share-store-minute-repartition-1] for changelog 
flash-app-xmc-worker-share-store-minute-changelog-12019-11-19 05:50:49.816|WARN 
|flash-client-xmc-StreamThread-3|o.a.k.s.p.i.StoreChangelogReader|101|stream-thread
 [flash-client-xmc-StreamThread-3] Restoring StreamTasks failed. Deleting 
StreamTasks stores to recreate from 
scratch.org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets 
out of range with no configured reset policy for partitions: 
{flash-app-xmc-worker-share-store-minute-changelog-1=6128684} at 
org.apache.kafka.clients.consumer.internals.Fetcher.parseCompletedFetch(Fetcher.java:987)2019-11-19
 05:50:49.817|INFO 
|flash-client-xmc-StreamThread-3|o.a.k.s.p.i.StoreChangelogReader|105|stream-thread
 [flash-client-xmc-StreamThread-3] Reinitializing StreamTask TaskId: 10_1 
ProcessorTopology: KSTREAM-SOURCE-70: topics: 
[flash-app-xmc-worker-share-store-minute-repartition] children: 
[KSTREAM-AGGREGATE-67] KSTREAM-AGGREGATE-67: states: 
[worker-share-store-minute] children: [KTABLE-TOSTREAM-71] 
KTABLE-TOSTREAM-71: children: [KSTREAM-SINK-72] 
KSTREAM-SINK-72: topic: 
StaticTopicNameExtractor(xmc-worker-share-minute)Partitions 
[flash-app-xmc-worker-share-store-minute-repartition-1] for changelog 
flash-app-xmc-worker-share-store-minute-changelog-12019-11-19 05:50:49.842|WARN 
|flash-client-xmc-StreamThread-3|o.a.k.s.p.i.StoreChangelogReader|101|stream-thread
 [flash-client-xmc-StreamThread-3] Restoring StreamTasks failed. Deleting 
StreamTasks stores to recreate from 
scratch.org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets 
out of range with no configured reset policy for partitions: 
{flash-app-xmc-worker-share-store-minute-changelog-1=6128684} at 
org.apache.kafka.clients.consumer.internals.Fetcher.parseCompletedFetch(Fetcher.java:987)2019-11-19
 05:50:49.842|INFO 
|flash-client-xmc-StreamThread-3|o.a.k.s.p.i.StoreChangelogReader|105|stream-thread
 [flash-client-xmc-StreamThread-3] Reinitializing StreamTask TaskId: 10_1 
ProcessorTopology: KSTREAM-SOURCE-70: topics: 
[flash-app-xmc-worker-share-store-minute-repartition] children: 
[KSTREAM-AGGREGATE-67] KSTREAM-AGGREGATE-67: states: 
[worker-share-store-minute] children: [KTABLE-TOSTREAM-71] 
KTABLE-TOSTREAM-71: children: [KSTREAM-SINK-72] 
KSTREAM-SINK-72: topic: 
StaticTopicNameExtractor(xmc-worker-share-minute)Partitions 
[flash-app-xmc-worker-share-store-minute-repartition-1] for changelog 
flash-app-xmc-worker-share-store-minute-changelog-12019-11-19 05:50:49.905|WARN 
|flash-client-xmc-StreamThread-3|o.a.k.s.p.i.StoreChangelogReader|101|stream-thread
 [flash-client-xmc-StreamThread-3] Restoring StreamTasks failed. Deleting 
StreamTasks stores to recreate from 
scratch.org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets 
out of range with no configured reset policy for partitions: 
{flash-app-xmc-worker-share-store-minute-changelog-1=6128684} at 
org.apache.kafka.clients.consumer.internals.Fetcher.parseCompletedFetch(Fetcher.java:987)2019-11-19
 05:50:49.906|INFO 
|flash-client-xmc-StreamThread-3|o.a.k.s.p.i.StoreChangelogReader|105|stream-thread
 [flash-client-xmc-StreamThread-3] Reinitializing StreamTask TaskId: 10_1 
ProcessorTopology: KSTREAM-SOURCE-70: topics: 
[flash-app-xmc-worker-share-store-minute-repartition] children: 
[KSTREAM-AGGREGATE-67] KSTREAM-AGGREGATE-67: states: 
[worker-share-store-minute] {noformat}
 

  was:
kafka broker: 2.0.1

kafka stream client: 2.1.0
 # two applications run at the same time
 # after some days,I stop one application(in k8s)
 # The flollowing log occured and I check the data and find that value is less 
than what I expected.

 

```

 

 

Partitions [flash-app-xmc-worker-share-store-minute-repartition-1]Partitions 
[flash-app-xmc-worker-share-store-minute-repartition-1] for changelog 
flash-app-xmc-worker-share-store-minute-changelog-12019-11-19 05:50:49.816|WARN 
|flash-client-xmc-StreamThread-3|o.a.k.s.p.i.StoreChangelogReader|101|stream-thread
 [flash-client-xmc-StreamThread-3] Restoring StreamTasks failed. Deleting 
StreamTasks stores to recreate from 
scratch.org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets 
out of range with no configured reset policy for partitions: