[jira] [Commented] (KAFKA-9376) Plugin class loader not found using MM2

2020-01-10 Thread yzhou (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013391#comment-17013391
 ] 

yzhou commented on KAFKA-9376:
--

[~ryannedolan]

Here is a portion of the startup log:

[ org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader:264 ] - [ 
INFO ] Registered loader: sun.misc.Launcher$AppClassLoader@18b4aac2
[ org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader:193 ] - [ 
INFO ] Added plugin 'org.apache.kafka.connect.tools.MockConnector'
[ org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader:193 ] - [ 
INFO ] Added plugin 'org.apache.kafka.connect.tools.SchemaSourceConnector'
[ org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader:193 ] - [ 
INFO ] Added plugin 'org.apache.kafka.connect.mirror.MirrorSourceConnector'
[ org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader:193 ] - [ 
INFO ] Added plugin 'org.apache.kafka.connect.tools.VerifiableSinkConnector'
[ org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader:193 ] - [ 
INFO ] Added plugin 'org.apache.kafka.connect.mirror.MirrorCheckpointConnector'
[ org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader:193 ] - [ 
INFO ] Added plugin 'org.apache.kafka.connect.mirror.MirrorHeartbeatConnector'
[ org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader:193 ] - [ 
INFO ] Added plugin 'org.apache.kafka.connect.tools.VerifiableSourceConnector'
[ org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader:193 ] - [ 
INFO ] Added plugin 'org.apache.kafka.connect.tools.MockSourceConnector'
[ org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader:193 ] - [ 
INFO ] Added plugin 'org.apache.kafka.connect.tools.MockSinkConnector'
[ org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader:193 ] - [ 
INFO ] Added plugin 'org.apache.kafka.connect.converters.LongConverter'

 

MirrorCheckpointConnector,MirrorHeartbeatConnector,MirrorSourceConnector loaded 
by AppCliassLoader(getParent()) , but they should be loaded by pluginLoader 
independently (PluginUtils.shouldLoadInIsolation(name))

> Plugin class loader not found using MM2
> ---
>
> Key: KAFKA-9376
> URL: https://issues.apache.org/jira/browse/KAFKA-9376
> Project: Kafka
>  Issue Type: Bug
>  Components: mirrormaker
>Affects Versions: 2.4.0
>Reporter: Sinóros-Szabó Péter
>Priority: Minor
>
> I am using MM2 (release 2.4.0 with scala 2.12) I geta bunch of classloader 
> errors. MM2 seems to be working, but I do not know if all of it components 
> are working as expected as this is the first time I use MM2.
> I run MM2 with the following command:
> {code:java}
> ./bin/connect-mirror-maker.sh config/connect-mirror-maker.properties
> {code}
> Errors are:
> {code:java}
> [2020-01-07 15:06:17,892] ERROR Plugin class loader for connector: 
> 'org.apache.kafka.connect.mirror.MirrorHeartbeatConnector' was not found. 
> Returning: 
> org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader@6ebf0f36 
> (org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader:165)
> [2020-01-07 15:06:17,889] ERROR Plugin class loader for connector: 
> 'org.apache.kafka.connect.mirror.MirrorHeartbeatConnector' was not found. 
> Returning: 
> org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader@6ebf0f36 
> (org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader:165)
> [2020-01-07 15:06:17,904] INFO ConnectorConfig values:
>  config.action.reload = restart
>  connector.class = org.apache.kafka.connect.mirror.MirrorHeartbeatConnector
>  errors.log.enable = false
>  errors.log.include.messages = false
>  errors.retry.delay.max.ms = 6
>  errors.retry.timeout = 0
>  errors.tolerance = none
>  header.converter = null
>  key.converter = null
>  name = MirrorHeartbeatConnector
>  tasks.max = 1
>  transforms = []
>  value.converter = null
>  (org.apache.kafka.connect.runtime.ConnectorConfig:347)
> [2020-01-07 15:06:17,904] INFO EnrichedConnectorConfig values:
>  config.action.reload = restart
>  connector.class = org.apache.kafka.connect.mirror.MirrorHeartbeatConnector
>  errors.log.enable = false
>  errors.log.include.messages = false
>  errors.retry.delay.max.ms = 6
>  errors.retry.timeout = 0
>  errors.tolerance = none
>  header.converter = null
>  key.converter = null
>  name = MirrorHeartbeatConnector
>  tasks.max = 1
>  transforms = []
>  value.converter = null
>  
> (org.apache.kafka.connect.runtime.ConnectorConfig$EnrichedConnectorConfig:347)
> [2020-01-07 15:06:17,905] INFO TaskConfig values:
>  task.class = class org.apache.kafka.connect.mirror.MirrorHeartbeatTask
>  (org.apache.kafka.connect.runtime.TaskConfig:347)
> [2020-01-07 15:06:17,905] INFO Instantiated task MirrorHeartbeatConnector-0 
> with version 1 of type 

[jira] [Commented] (KAFKA-8770) Either switch to or add an option for emit-on-change

2020-01-10 Thread John Roesler (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013373#comment-17013373
 ] 

John Roesler commented on KAFKA-8770:
-

Hey!

Thanks for the effort on this. I looked over your draft. For the KIP, I think 
the only question is whether this should simply become the default, or whether 
there should be a config. Or whether it should be the new default with an 
opt-out config. 

I don’t think the question of whether we should separately store a hash of the 
value vs. the value itself really needs to be discussed in a kip. That would be 
more like an implementation discussion. 

> Either switch to or add an option for emit-on-change
> 
>
> Key: KAFKA-8770
> URL: https://issues.apache.org/jira/browse/KAFKA-8770
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: John Roesler
>Priority: Major
>  Labels: needs-kip
>
> Currently, Streams offers two emission models:
> * emit-on-window-close: (using Suppression)
> * emit-on-update: (i.e., emit a new result whenever a new record is 
> processed, regardless of whether the result has changed)
> There is also an option to drop some intermediate results, either using 
> caching or suppression.
> However, there is no support for emit-on-change, in which results would be 
> forwarded only if the result has changed. This has been reported to be 
> extremely valuable as a performance optimizations for some high-traffic 
> applications, and it reduces the computational burden both internally for 
> downstream Streams operations, as well as for external systems that consume 
> the results, and currently have to deal with a lot of "no-op" changes.
> It would be pretty straightforward to implement this, by loading the prior 
> results before a stateful operation and comparing with the new result before 
> persisting or forwarding. In many cases, we load the prior result anyway, so 
> it may not be a significant performance impact either.
> One design challenge is what to do with timestamps. If we get one record at 
> time 1 that produces a result, and then another at time 2 that produces a 
> no-op, what should be the timestamp of the result, 1 or 2? emit-on-change 
> would require us to say 1.
> Clearly, we'd need to do some serious benchmarks to evaluate any potential 
> implementation of emit-on-change.
> Another design challenge is to decide if we should just automatically provide 
> emit-on-change for stateful operators, or if it should be configurable. 
> Configuration increases complexity, so unless the performance impact is high, 
> we may just want to change the emission model without a configuration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (KAFKA-8770) Either switch to or add an option for emit-on-change

2020-01-10 Thread Richard Yu (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013370#comment-17013370
 ] 

Richard Yu edited comment on KAFKA-8770 at 1/11/20 4:23 AM:


I have created a draft KIP for this JIRA. [~xmar]  [~mjsax] [~vvcephei] Input 
would be greatly appreciated! Right now, I've not formalized any API additions 
/ configuration changes. It would be good first if we can get some discussion 
on what is needed and what is not! 

[https://cwiki.apache.org/confluence/display/KAFKA/KIP-557%3A+Add+emit+on+change+support+for+Kafka+Streams]

 


was (Author: yohan123):
I have created a draft KIP for this JIRA. [~xmar]  [~mjsax] [~vvcephei] Input 
would be greatly appreciated! Right now, I've not formalized any API additions 
/ configuration changes. It would be good first if we can get some discussion 
on what is needed and what is not! 

[https://cwiki.apache.org/confluence/display/KAFKA/KIP-NUM%3A+Add+emit+on+change+support+for+Kafka+Streams#KIP-NUM:AddemitonchangesupportforKafkaStreams-DetailsonCoreImprovement]

 

> Either switch to or add an option for emit-on-change
> 
>
> Key: KAFKA-8770
> URL: https://issues.apache.org/jira/browse/KAFKA-8770
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: John Roesler
>Priority: Major
>  Labels: needs-kip
>
> Currently, Streams offers two emission models:
> * emit-on-window-close: (using Suppression)
> * emit-on-update: (i.e., emit a new result whenever a new record is 
> processed, regardless of whether the result has changed)
> There is also an option to drop some intermediate results, either using 
> caching or suppression.
> However, there is no support for emit-on-change, in which results would be 
> forwarded only if the result has changed. This has been reported to be 
> extremely valuable as a performance optimizations for some high-traffic 
> applications, and it reduces the computational burden both internally for 
> downstream Streams operations, as well as for external systems that consume 
> the results, and currently have to deal with a lot of "no-op" changes.
> It would be pretty straightforward to implement this, by loading the prior 
> results before a stateful operation and comparing with the new result before 
> persisting or forwarding. In many cases, we load the prior result anyway, so 
> it may not be a significant performance impact either.
> One design challenge is what to do with timestamps. If we get one record at 
> time 1 that produces a result, and then another at time 2 that produces a 
> no-op, what should be the timestamp of the result, 1 or 2? emit-on-change 
> would require us to say 1.
> Clearly, we'd need to do some serious benchmarks to evaluate any potential 
> implementation of emit-on-change.
> Another design challenge is to decide if we should just automatically provide 
> emit-on-change for stateful operators, or if it should be configurable. 
> Configuration increases complexity, so unless the performance impact is high, 
> we may just want to change the emission model without a configuration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-8770) Either switch to or add an option for emit-on-change

2020-01-10 Thread Richard Yu (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013370#comment-17013370
 ] 

Richard Yu commented on KAFKA-8770:
---

I have created a draft KIP for this JIRA. [~xmar]  [~mjsax] [~vvcephei] Input 
would be greatly appreciated! Right now, I've not formalized any API additions 
/ configuration changes. It would be good first if we can get some discussion 
on what is needed and what is not! 

[https://cwiki.apache.org/confluence/display/KAFKA/KIP-NUM%3A+Add+emit+on+change+support+for+Kafka+Streams#KIP-NUM:AddemitonchangesupportforKafkaStreams-DetailsonCoreImprovement]

 

> Either switch to or add an option for emit-on-change
> 
>
> Key: KAFKA-8770
> URL: https://issues.apache.org/jira/browse/KAFKA-8770
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: John Roesler
>Priority: Major
>  Labels: needs-kip
>
> Currently, Streams offers two emission models:
> * emit-on-window-close: (using Suppression)
> * emit-on-update: (i.e., emit a new result whenever a new record is 
> processed, regardless of whether the result has changed)
> There is also an option to drop some intermediate results, either using 
> caching or suppression.
> However, there is no support for emit-on-change, in which results would be 
> forwarded only if the result has changed. This has been reported to be 
> extremely valuable as a performance optimizations for some high-traffic 
> applications, and it reduces the computational burden both internally for 
> downstream Streams operations, as well as for external systems that consume 
> the results, and currently have to deal with a lot of "no-op" changes.
> It would be pretty straightforward to implement this, by loading the prior 
> results before a stateful operation and comparing with the new result before 
> persisting or forwarding. In many cases, we load the prior result anyway, so 
> it may not be a significant performance impact either.
> One design challenge is what to do with timestamps. If we get one record at 
> time 1 that produces a result, and then another at time 2 that produces a 
> no-op, what should be the timestamp of the result, 1 or 2? emit-on-change 
> would require us to say 1.
> Clearly, we'd need to do some serious benchmarks to evaluate any potential 
> implementation of emit-on-change.
> Another design challenge is to decide if we should just automatically provide 
> emit-on-change for stateful operators, or if it should be configurable. 
> Configuration increases complexity, so unless the performance impact is high, 
> we may just want to change the emission model without a configuration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KAFKA-7499) Extend ProductionExceptionHandler to cover serialization exceptions

2020-01-10 Thread jbfletch (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jbfletch reassigned KAFKA-7499:
---

Assignee: jbfletch  (was: Walker Carlson)

> Extend ProductionExceptionHandler to cover serialization exceptions
> ---
>
> Key: KAFKA-7499
> URL: https://issues.apache.org/jira/browse/KAFKA-7499
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Matthias J. Sax
>Assignee: jbfletch
>Priority: Major
>  Labels: beginner, kip, newbie
>
> In 
> [KIP-210|https://cwiki.apache.org/confluence/display/KAFKA/KIP-210+-+Provide+for+custom+error+handling++when+Kafka+Streams+fails+to+produce],
>  an exception handler for the write path was introduced. This exception 
> handler covers exception that are raised in the producer callback.
> However, serialization happens before the data is handed to the producer with 
> Kafka Streams itself and the producer uses `byte[]/byte[]` key-value-pair 
> types.
> Thus, we might want to extend the ProductionExceptionHandler to cover 
> serialization exception, too, to skip over corrupted output messages. An 
> example could be a "String" message that contains invalid JSON and should be 
> serialized as JSON.
> KIP-399 (not voted yet; feel free to pick it up): 
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-399%3A+Extend+ProductionExceptionHandler+to+cover+serialization+exceptions]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-2758) Improve Offset Commit Behavior

2020-01-10 Thread Guozhang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-2758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013271#comment-17013271
 ] 

Guozhang Wang commented on KAFKA-2758:
--

We used to put it on hold especially for 1) since KIP-211 is not merged yet, 
however even now after KIP-211 is merged we should be careful since a newer 
versioned client may talk to an older versioned broker (2.0-) which does not 
have KIP-211 yet. We have some plans for automatically detecting broker 
versions so I'd suggest before that we do not pick up this ticket yet.

> Improve Offset Commit Behavior
> --
>
> Key: KAFKA-2758
> URL: https://issues.apache.org/jira/browse/KAFKA-2758
> Project: Kafka
>  Issue Type: Improvement
>  Components: consumer
>Reporter: Guozhang Wang
>Priority: Major
>  Labels: newbie, reliability
>
> There are two scenarios of offset committing that we can improve:
> 1) we can filter the partitions whose committed offset is equal to the 
> consumed offset, meaning there is no new consumed messages from this 
> partition and hence we do not need to include this partition in the commit 
> request.
> 2) we can make a commit request right after resetting to a fetch / consume 
> position either according to the reset policy (e.g. on consumer starting up, 
> or handling of out of range offset, etc), or through the {code} seek {code} 
> so that if the consumer fails right after these event, upon recovery it can 
> restarts from the reset position instead of resetting again: this can lead 
> to, for example, data loss if we use "largest" as reset policy while there 
> are new messages coming to the fetching partitions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (KAFKA-8770) Either switch to or add an option for emit-on-change

2020-01-10 Thread Richard Yu (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013230#comment-17013230
 ] 

Richard Yu edited comment on KAFKA-8770 at 1/10/20 10:09 PM:
-

[~vvcephei] Actually, I did think of something which might be very useful as a 
performance enhancement. As mentioned in the JIRA description, Kafka Streams 
would load prior results and compare them to the original. However, that 
nonetheless has potential to be a severe hit to processing speed. I propose 
that instead of loading the prior results, we just get the hash code for that 
prior result instead.

If there is a no op, the hash code of the prior result would be the same as the 
one that we have currently. However, if the result has _changed,_ then if the 
hash code function have been implemented correctly, the hash code would have 
changed correspondingly as well. Therefore, what should be done is the 
following:
 # We keep the hash codes of prior results in some store / whatever other 
device we might be able to use for storage. 
 # Whenever we obtain a new processed result,  retrieve corresponding prior 
hashcode to see if it had changed. 
 # Update store / table as necessary if the hash code has changed. 


was (Author: yohan123):
[~vvcephei] Actually, I did think of something which might be very useful as a 
performance enhancement. As mentioned in the JIRA description, Kafka Streams 
would load prior results and compare them to the original. However, that 
nonetheless has potential to be a severe hit to processing speed. I propose 
that instead of loading the prior results, we just get the hash code for that 
prior result instead.

If there is a no op, the hash code of the prior result would be the same as the 
one that we have currently. However, if the result has _changed,_ then if the 
hash code function have been implemented correctly, the hash code would have 
changed correspondingly as well. Therefore, what should be done is the 
following:
 # We keep the hash codes of prior results in some store / whatever other 
device we might be able to use for storage. 
 # Whenever we obtain a new processed result,  retrieve corresponding prior 
hashcode to see if it had changed. 
 # Update store / table as necessary if the hash code has changed.

 

 

> Either switch to or add an option for emit-on-change
> 
>
> Key: KAFKA-8770
> URL: https://issues.apache.org/jira/browse/KAFKA-8770
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: John Roesler
>Priority: Major
>  Labels: needs-kip
>
> Currently, Streams offers two emission models:
> * emit-on-window-close: (using Suppression)
> * emit-on-update: (i.e., emit a new result whenever a new record is 
> processed, regardless of whether the result has changed)
> There is also an option to drop some intermediate results, either using 
> caching or suppression.
> However, there is no support for emit-on-change, in which results would be 
> forwarded only if the result has changed. This has been reported to be 
> extremely valuable as a performance optimizations for some high-traffic 
> applications, and it reduces the computational burden both internally for 
> downstream Streams operations, as well as for external systems that consume 
> the results, and currently have to deal with a lot of "no-op" changes.
> It would be pretty straightforward to implement this, by loading the prior 
> results before a stateful operation and comparing with the new result before 
> persisting or forwarding. In many cases, we load the prior result anyway, so 
> it may not be a significant performance impact either.
> One design challenge is what to do with timestamps. If we get one record at 
> time 1 that produces a result, and then another at time 2 that produces a 
> no-op, what should be the timestamp of the result, 1 or 2? emit-on-change 
> would require us to say 1.
> Clearly, we'd need to do some serious benchmarks to evaluate any potential 
> implementation of emit-on-change.
> Another design challenge is to decide if we should just automatically provide 
> emit-on-change for stateful operators, or if it should be configurable. 
> Configuration increases complexity, so unless the performance impact is high, 
> we may just want to change the emission model without a configuration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-8770) Either switch to or add an option for emit-on-change

2020-01-10 Thread Richard Yu (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013230#comment-17013230
 ] 

Richard Yu commented on KAFKA-8770:
---

[~vvcephei] Actually, I did think of something which might be very useful as a 
performance enhancement. As mentioned in the JIRA description, Kafka Streams 
would load prior results and compare them to the original. However, that 
nonetheless has potential to be a severe hit to processing speed. I propose 
that instead of loading the prior results, we just get the hash code for that 
prior result instead.

If there is a no op, the hash code of the prior result would be the same as the 
one that we have currently. However, if the result has _changed,_ then if the 
hash code function have been implemented correctly, the hash code would have 
changed correspondingly as well. Therefore, what should be done is the 
following:
 # We keep the hash codes of prior results in some store / whatever other 
device we might be able to use for storage. 
 # Whenever we obtain a new processed result,  retrieve corresponding prior 
hashcode to see if it had changed. 
 # Update store / table as necessary if the hash code has changed.

 

 

> Either switch to or add an option for emit-on-change
> 
>
> Key: KAFKA-8770
> URL: https://issues.apache.org/jira/browse/KAFKA-8770
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: John Roesler
>Priority: Major
>  Labels: needs-kip
>
> Currently, Streams offers two emission models:
> * emit-on-window-close: (using Suppression)
> * emit-on-update: (i.e., emit a new result whenever a new record is 
> processed, regardless of whether the result has changed)
> There is also an option to drop some intermediate results, either using 
> caching or suppression.
> However, there is no support for emit-on-change, in which results would be 
> forwarded only if the result has changed. This has been reported to be 
> extremely valuable as a performance optimizations for some high-traffic 
> applications, and it reduces the computational burden both internally for 
> downstream Streams operations, as well as for external systems that consume 
> the results, and currently have to deal with a lot of "no-op" changes.
> It would be pretty straightforward to implement this, by loading the prior 
> results before a stateful operation and comparing with the new result before 
> persisting or forwarding. In many cases, we load the prior result anyway, so 
> it may not be a significant performance impact either.
> One design challenge is what to do with timestamps. If we get one record at 
> time 1 that produces a result, and then another at time 2 that produces a 
> no-op, what should be the timestamp of the result, 1 or 2? emit-on-change 
> would require us to say 1.
> Clearly, we'd need to do some serious benchmarks to evaluate any potential 
> implementation of emit-on-change.
> Another design challenge is to decide if we should just automatically provide 
> emit-on-change for stateful operators, or if it should be configurable. 
> Configuration increases complexity, so unless the performance impact is high, 
> we may just want to change the emission model without a configuration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-6078) Investigate failure of ReassignPartitionsClusterTest.shouldExpandCluster

2020-01-10 Thread Matthias J. Sax (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-6078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013208#comment-17013208
 ] 

Matthias J. Sax commented on KAFKA-6078:


[https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/172/testReport/junit/kafka.admin/ReassignPartitionsClusterTest/shouldExpandCluster/]

> Investigate failure of ReassignPartitionsClusterTest.shouldExpandCluster
> 
>
> Key: KAFKA-6078
> URL: https://issues.apache.org/jira/browse/KAFKA-6078
> Project: Kafka
>  Issue Type: Bug
>  Components: core, unit tests
>Affects Versions: 2.3.0
>Reporter: Dong Lin
>Priority: Major
>  Labels: flaky-test
> Fix For: 2.5.0
>
>
> See https://github.com/apache/kafka/pull/4084



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-9399) Flaky Test BranchedMultiLevelRepartitionConnectedTopologyTest.testTopologyBuild

2020-01-10 Thread Matthias J. Sax (Jira)
Matthias J. Sax created KAFKA-9399:
--

 Summary: Flaky Test 
BranchedMultiLevelRepartitionConnectedTopologyTest.testTopologyBuild
 Key: KAFKA-9399
 URL: https://issues.apache.org/jira/browse/KAFKA-9399
 Project: Kafka
  Issue Type: Bug
  Components: streams, unit tests
Affects Versions: 2.5.0
Reporter: Matthias J. Sax


[https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/172/testReport/junit/org.apache.kafka.streams.integration/BranchedMultiLevelRepartitionConnectedTopologyTest/testTopologyBuild/]
{quote}java.lang.AssertionError: Condition not met within timeout 15000. Failed 
to observe stream transits to RUNNING at 
org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:26) at 
org.apache.kafka.test.TestUtils.lambda$waitForCondition$4(TestUtils.java:369) 
at 
org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:417) 
at 
org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:385) 
at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:368) at 
org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:339) at 
org.apache.kafka.streams.integration.BranchedMultiLevelRepartitionConnectedTopologyTest.testTopologyBuild(BranchedMultiLevelRepartitionConnectedTopologyTest.java:146){quote}
STDOUT
{quote}[2020-01-10 20:54:59,190] WARN [Consumer 
clientId=branched-repartition-topic-test-7bb4acef-c1c1-401f-adb9-a67a448cae02-StreamThread-1-consumer,
 groupId=branched-repartition-topic-test] Connection to node 0 
(localhost/127.0.0.1:38720) could not be established. Broker may not be 
available. (org.apache.kafka.clients.NetworkClient:756) [2020-01-10 
20:54:59,264] WARN [Producer 
clientId=branched-repartition-topic-test-7bb4acef-c1c1-401f-adb9-a67a448cae02-StreamThread-1-producer]
 Connection to node 0 (localhost/127.0.0.1:38720) could not be established. 
Broker may not be available. (org.apache.kafka.clients.NetworkClient:756) 
[2020-01-10 20:54:59,314] WARN [AdminClient 
clientId=branched-repartition-topic-test-7bb4acef-c1c1-401f-adb9-a67a448cae02-admin]
 Connection to node 0 (localhost/127.0.0.1:38720) could not be established. 
Broker may not be available. (org.apache.kafka.clients.NetworkClient:756)
...
[2020-01-10 20:54:59,356] WARN [Consumer 
clientId=branched-repartition-topic-test-7bb4acef-c1c1-401f-adb9-a67a448cae02-StreamThread-1-restore-consumer,
 groupId=null] Connection to node -1 (localhost/127.0.0.1:38720) could not be 
established. Broker may not be available. 
(org.apache.kafka.clients.NetworkClient:756) [2020-01-10 20:54:59,356] WARN 
[Consumer 
clientId=branched-repartition-topic-test-7bb4acef-c1c1-401f-adb9-a67a448cae02-StreamThread-1-restore-consumer,
 groupId=null] Bootstrap broker localhost:38720 (id: -1 rack: null) 
disconnected (org.apache.kafka.clients.NetworkClient:1024) [2020-01-10 
20:54:59,391] WARN [Consumer 
clientId=branched-repartition-topic-test-7bb4acef-c1c1-401f-adb9-a67a448cae02-StreamThread-1-consumer,
 groupId=branched-repartition-topic-test] Connection to node 0 
(localhost/127.0.0.1:38720) could not be established. Broker may not be 
available. (org.apache.kafka.clients.NetworkClient:756) [2020-01-10 
20:54:59,457] WARN [Consumer 
clientId=branched-repartition-topic-test-7bb4acef-c1c1-401f-adb9-a67a448cae02-StreamThread-1-restore-consumer,
 groupId=null] Connection to node -1 (localhost/127.0.0.1:38720) could not be 
established. Broker may not be available. 
(org.apache.kafka.clients.NetworkClient:756) [2020-01-10 20:54:59,457] WARN 
[Consumer 
clientId=branched-repartition-topic-test-7bb4acef-c1c1-401f-adb9-a67a448cae02-StreamThread-1-restore-consumer,
 groupId=null] Bootstrap broker localhost:38720 (id: -1 rack: null) 
disconnected (org.apache.kafka.clients.NetworkClient:1024)
...
and some more of those...{quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-8764) LogCleanerManager endless loop while compacting/cleaning segments

2020-01-10 Thread Tomislav Rajakovic (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013187#comment-17013187
 ] 

Tomislav Rajakovic commented on KAFKA-8764:
---

[~jnadler] for now, try to "help" LogCleaner by following solution steps from 
above. Idea is to move it's state file (cleaner-offset-checkpoint inside 
topic-partition folder) on all brokers that experiencing issue. Once when next 
LogSegment becomes available for cleaning, LogCleaner would fix himself and 
continue to work as expected (unless you'll have new record "holes", but that 
is, as [~junrao] stated, is rare event happening in edge cases).

 

Advanced solution is to checkout kafka 2.4.0 from github, apply patch attached 
to this issue, rebuild kafka and run patched version of kafka with your data, 
and issue, hopefully, should be gone. 

> LogCleanerManager endless loop while compacting/cleaning segments
> -
>
> Key: KAFKA-8764
> URL: https://issues.apache.org/jira/browse/KAFKA-8764
> Project: Kafka
>  Issue Type: Bug
>  Components: log cleaner
>Affects Versions: 2.3.0, 2.2.1, 2.4.0
> Environment: docker base image: openjdk:8-jre-alpine base image, 
> kafka from http://ftp.carnet.hr/misc/apache/kafka/2.2.1/kafka_2.12-2.2.1.tgz
>Reporter: Tomislav Rajakovic
>Priority: Major
>  Labels: patch
> Attachments: Screen Shot 2020-01-10 at 8.38.25 AM.png, 
> kafka2.4.0-KAFKA-8764.patch, kafka2.4.0-KAFKA-8764.patch, 
> log-cleaner-bug-reproduction.zip
>
>
> {{LogCleanerManager stuck in endless loop while clearing segments for one 
> partition resulting with many log outputs and heavy disk read/writes/IOPS.}}
>  
> Issue appeared on follower brokers, and it happens on every (new) broker if 
> partition assignment is changed.
>  
> Original issue setup:
>  * kafka_2.12-2.2.1 deployed as statefulset on kubernetes, 5 brokers
>  * log directory is (AWS) EBS mounted PV, gp2 (ssd) kind of 750GiB
>  * 5 zookeepers
>  * topic created with config:
>  ** name = "backup_br_domain_squad"
> partitions = 36
> replication_factor = 3
> config = {
>  "cleanup.policy" = "compact"
>  "min.compaction.lag.ms" = "8640"
>  "min.cleanable.dirty.ratio" = "0.3"
> }
>  
>  
> Log excerpt:
> {{[2019-08-07 12:10:53,895] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:53,895] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,896] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,896] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,031] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,101] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:54,101] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,101] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,101] INFO Deleted time index 
> 

[jira] [Commented] (KAFKA-8764) LogCleanerManager endless loop while compacting/cleaning segments

2020-01-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013179#comment-17013179
 ] 

ASF GitHub Bot commented on KAFKA-8764:
---

trajakovic commented on pull request #7932: KAFKA-8764: LogCleanerManager 
endless loop while compacting/clea
URL: https://github.com/apache/kafka/pull/7932
 
 
   This PR fixes LogCleaner's endless loop while clearing LogSegemnts with 
holes.
   
   
   In rare cases, when clearing LogSegments with missing records, LogCleaner 
was unable to progress resulting with high CPU usage, high disk read/writes and 
excessive cleaner logs (if enabled). This PR addresses such situation by 
skipping missing record(s) and, as result, avoiding endless loop while clearing 
such Logs.
   
   
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> LogCleanerManager endless loop while compacting/cleaning segments
> -
>
> Key: KAFKA-8764
> URL: https://issues.apache.org/jira/browse/KAFKA-8764
> Project: Kafka
>  Issue Type: Bug
>  Components: log cleaner
>Affects Versions: 2.3.0, 2.2.1, 2.4.0
> Environment: docker base image: openjdk:8-jre-alpine base image, 
> kafka from http://ftp.carnet.hr/misc/apache/kafka/2.2.1/kafka_2.12-2.2.1.tgz
>Reporter: Tomislav Rajakovic
>Priority: Major
>  Labels: patch
> Attachments: Screen Shot 2020-01-10 at 8.38.25 AM.png, 
> kafka2.4.0-KAFKA-8764.patch, kafka2.4.0-KAFKA-8764.patch, 
> log-cleaner-bug-reproduction.zip
>
>
> {{LogCleanerManager stuck in endless loop while clearing segments for one 
> partition resulting with many log outputs and heavy disk read/writes/IOPS.}}
>  
> Issue appeared on follower brokers, and it happens on every (new) broker if 
> partition assignment is changed.
>  
> Original issue setup:
>  * kafka_2.12-2.2.1 deployed as statefulset on kubernetes, 5 brokers
>  * log directory is (AWS) EBS mounted PV, gp2 (ssd) kind of 750GiB
>  * 5 zookeepers
>  * topic created with config:
>  ** name = "backup_br_domain_squad"
> partitions = 36
> replication_factor = 3
> config = {
>  "cleanup.policy" = "compact"
>  "min.compaction.lag.ms" = "8640"
>  "min.cleanable.dirty.ratio" = "0.3"
> }
>  
>  
> Log excerpt:
> {{[2019-08-07 12:10:53,895] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:53,895] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,896] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,896] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,031] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,101] INFO [Log partition=backup_br_domain_squad-14, 
> 

[jira] [Commented] (KAFKA-9253) Test failure : ReassignPartitionsClusterTest.shouldListMovingPartitionsThroughApi

2020-01-10 Thread Bill Bejeck (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013152#comment-17013152
 ] 

Bill Bejeck commented on KAFKA-9253:


Seen again in 
[https://builds.apache.org/job/kafka-pr-jdk11-scala2.13/4182/testReport/junit/kafka.admin/ReassignPartitionsClusterTest/shouldListMovingPartitionsThroughApi/]

 
{noformat}
Error 
Messagejava.lang.NullPointerExceptionStacktracejava.lang.NullPointerException
at 
kafka.admin.ReassignPartitionsClusterTest.assertIsReassigning(ReassignPartitionsClusterTest.scala:1190)
at 
kafka.admin.ReassignPartitionsClusterTest.shouldListMovingPartitionsThroughApi(ReassignPartitionsClusterTest.scala:811)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
at 
org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
at 
org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.runTestClass(JUnitTestClassExecutor.java:110)
at 
org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:58)
at 
org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:38)
at 
org.gradle.api.internal.tasks.testing.junit.AbstractJUnitTestClassProcessor.processTestClass(AbstractJUnitTestClassProcessor.java:62)
at 
org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.processTestClass(SuiteTestClassProcessor.java:51)
at jdk.internal.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at 
org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:36)
at 
org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
at 
org.gradle.internal.dispatch.ContextClassLoaderDispatch.dispatch(ContextClassLoaderDispatch.java:33)
at 
org.gradle.internal.dispatch.ProxyDispatchAdapter$DispatchingInvocationHandler.invoke(ProxyDispatchAdapter.java:94)
at com.sun.proxy.$Proxy2.processTestClass(Unknown Source)
at 
org.gradle.api.internal.tasks.testing.worker.TestWorker.processTestClass(TestWorker.java:118)
at jdk.internal.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at 
org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:36)
at 
org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
at 
org.gradle.internal.remote.internal.hub.MessageHubBackedObjectConnection$DispatchWrapper.dispatch(MessageHubBackedObjectConnection.java:182)
at 

[jira] [Commented] (KAFKA-9397) Deprecate Direct Zookeeper access in Kafka Administrative Tools

2020-01-10 Thread Jira


[ 
https://issues.apache.org/jira/browse/KAFKA-9397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013151#comment-17013151
 ] 

Gérald Quintana commented on KAFKA-9397:


kafka-acl.sh doesn''t have --zookeeper argument, but --authorizer-property 
zookeeper.connect is the same at the end

> Deprecate Direct Zookeeper access in Kafka Administrative Tools
> ---
>
> Key: KAFKA-9397
> URL: https://issues.apache.org/jira/browse/KAFKA-9397
> Project: Kafka
>  Issue Type: Improvement
>  Components: tools
>Affects Versions: 2.5.0
>Reporter: Colin McCabe
>Assignee: Colin McCabe
>Priority: Major
> Fix For: 2.5.0
>
>
> KIP-555: Deprecate Direct Zookeeper access in Kafka Administrative Tools



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9393) DeleteRecords may cause extreme lock contention for large partition directories

2020-01-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013150#comment-17013150
 ] 

ASF GitHub Bot commented on KAFKA-9393:
---

gardnervickers commented on pull request #7929: KAFKA-9393: Establish a 1:1 
mapping between producer state snapshot files and segment files.
URL: https://github.com/apache/kafka/pull/7929
 
 
   https://issues.apache.org/jira/browse/KAFKA-9393
   This PR avoids a performance issue with `DeleteRecords` when a partition 
directory contains high numbers of files. Previously, `DeleteRecords` would 
iterate the partition directory searching for producer state snapshot files. 
With this change, the iteration is removed in favor of keeping a 1:1 mapping 
between producer state snapshot file and segment file. A segment files 
corresponding producer state snapshot file is now deleted when the segment file 
is deleted. 
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> DeleteRecords may cause extreme lock contention for large partition 
> directories
> ---
>
> Key: KAFKA-9393
> URL: https://issues.apache.org/jira/browse/KAFKA-9393
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.4.0
>Reporter: Lucas Bradstreet
>Priority: Major
>
> DeleteRecords, frequently used by KStreams triggers a 
> Log.maybeIncrementLogStartOffset call, calling 
> kafka.log.ProducerStateManager.listSnapshotFiles which calls 
> java.io.File.listFiles on the partition dir. The time taken to list this 
> directory can be extreme for partitions with many small segments (e.g 2) 
> taking multiple seconds to finish. This causes lock contention for the log, 
> and if produce requests are also occurring for the same log can cause a 
> majority of request handler threads to become blocked waiting for the 
> DeleteRecords call to finish.
> I believe this is a problem going back to the initial implementation of the 
> transactional producer, but I need to confirm how far back it goes.
> One possible solution is to maintain a producer state snapshot aligned to the 
> log segment, and simply delete it whenever we delete a segment. This would 
> ensure that we never have to perform a directory scan.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9398) Kafka Streams main thread may not exit even after close timeout has passed

2020-01-10 Thread Matthias J. Sax (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013115#comment-17013115
 ] 

Matthias J. Sax commented on KAFKA-9398:


Other one: https://issues.apache.org/jira/browse/KAFKA-8178

> Kafka Streams main thread may not exit even after close timeout has passed
> --
>
> Key: KAFKA-9398
> URL: https://issues.apache.org/jira/browse/KAFKA-9398
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Bill Bejeck
>Assignee: Bill Bejeck
>Priority: Critical
> Fix For: 2.5.0
>
>
> Kafka Streams offers the KafkaStreams.close() method when shutting down a 
> Kafka Streams application. There are two overloads to this method, one that 
> takes no parameters and another taking a Duration specifying how long the 
> close() method should block waiting for streams shut down operations to 
> complete. The no-arg version of close() sets the timeout to Long.MAX_VALUE.
> The issue is that if a StreamThread is taking to long to complete or if one 
> of the Consumer or Producer clients is in a hung state, the Kafka Streams 
> application won't exit even after the specified timeout has expired.
> For example, consider this scenario:
>  # A sink topic gets deleted by accident 
>  # The user sets Producer max.block.ms config to a high value
> In this case, the Producer will issue a WARN logging statement and will 
> continue to make metadata requests looking for the expected topic. The 
> {{Producer}} will continue making metadata requests up until the max.block.ms 
> expires. If this value is high enough, calling close() with a timeout won't 
> fix the issue as when the timeout expires, the Kafka Streams application's 
> main thread won't exit.
> To prevent this type of issue, we should call Thread.interrupt() on all 
> StreamThread instances once the close() timeout has expired. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-8178) KafkaProducer#send(ProducerRecord,Callback) may block for up to 60 seconds

2020-01-10 Thread Matthias J. Sax (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013117#comment-17013117
 ] 

Matthias J. Sax commented on KAFKA-8178:


KAFKA-3450 is also related

> KafkaProducer#send(ProducerRecord,Callback) may block for up to 60 seconds
> --
>
> Key: KAFKA-8178
> URL: https://issues.apache.org/jira/browse/KAFKA-8178
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, producer 
>Reporter: Sergei Egorov
>Priority: Major
>
> Hello. I was running reactor-kafka with [the BlockHound 
> agent|https://github.com/reactor/BlockHound] (you can see the progress 
> [here|https://github.com/reactor/reactor-kafka/pull/75] and even run it 
> yourself) and it detected a very dangerous blocking call in 
> KafkaProducer#send(ProducerRecord,Callback) which is supposed to be async:
> {code:java}
> java.lang.Error: Blocking call! java.lang.Object#wait
>   at reactor.BlockHound$Builder.lambda$new$0(BlockHound.java:154)
>   at reactor.BlockHound$Builder.lambda$install$8(BlockHound.java:254)
>   at reactor.BlockHoundRuntime.checkBlocking(BlockHoundRuntime.java:43)
>   at java.lang.Object.wait(Object.java)
>   at org.apache.kafka.clients.Metadata.awaitUpdate(Metadata.java:181)
>   at 
> org.apache.kafka.clients.producer.KafkaProducer.waitOnMetadata(KafkaProducer.java:938)
>   at 
> org.apache.kafka.clients.producer.KafkaProducer.doSend(KafkaProducer.java:823)
>   at 
> org.apache.kafka.clients.producer.KafkaProducer.send(KafkaProducer.java:803)
> {code}
> it blocks for up to "maxBlockTimeMs" (60 seconds by default)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-6571) KafkaProducer.close(0) should be non-blocking

2020-01-10 Thread Matthias J. Sax (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-6571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias J. Sax resolved KAFKA-6571.

Resolution: Duplicate

KIP-367 add `close(Duration)` with new non-blocking semantics: 
[https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=89070496]

Closing this.

> KafkaProducer.close(0) should be non-blocking
> -
>
> Key: KAFKA-6571
> URL: https://issues.apache.org/jira/browse/KAFKA-6571
> Project: Kafka
>  Issue Type: Bug
>Reporter: Dong Lin
>Assignee: Ahmed Al-Mehdi
>Priority: Major
>
> According to the Java doc of producer.close(long timeout, TimeUnit timeUnit), 
> it is said that "Specifying a timeout of zero means do not wait for pending 
> send requests to complete". However, producer.close(0) can currently block on 
> waiting for the sender thread to exit, which in turn can block on user's 
> callback.
> We probably should not let producer.close(0) join the sender thread if user 
> has specified zero timeout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-3450) Producer blocks on send to topic that doesn't exist if auto create is disabled

2020-01-10 Thread Matthias J. Sax (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-3450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013110#comment-17013110
 ] 

Matthias J. Sax commented on KAFKA-3450:


Seem KAFKA-3539 is a duplicate?

> Producer blocks on send to topic that doesn't exist if auto create is disabled
> --
>
> Key: KAFKA-3450
> URL: https://issues.apache.org/jira/browse/KAFKA-3450
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Affects Versions: 0.9.0.1
>Reporter: Michal Turek
>Priority: Critical
>
> {{producer.send()}} is blocked for {{max.block.ms}} (default 60 seconds) if 
> the destination topic doesn't exist and if their automatic creation is 
> disabled. Warning from NetworkClient containing UNKNOWN_TOPIC_OR_PARTITION is 
> logged every 100 ms in a loop until the 60 seconds timeout expires, but the 
> operation is not recoverable.
> Preconditions
> - Kafka 0.9.0.1 with default configuration and auto.create.topics.enable=false
> - Kafka 0.9.0.1 clients.
> Example minimalist code
> https://github.com/avast/kafka-tests/blob/master/src/main/java/com/avast/kafkatests/othertests/nosuchtopic/NoSuchTopicTest.java
> {noformat}
> /**
>  * Test of sending to a topic that does not exist while automatic creation of 
> topics is disabled in Kafka (auto.create.topics.enable=false).
>  */
> public class NoSuchTopicTest {
> private static final Logger LOGGER = 
> LoggerFactory.getLogger(NoSuchTopicTest.class);
> public static void main(String[] args) {
> Properties properties = new Properties();
> properties.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, 
> "localhost:9092");
> properties.setProperty(ProducerConfig.CLIENT_ID_CONFIG, 
> NoSuchTopicTest.class.getSimpleName());
> properties.setProperty(ProducerConfig.MAX_BLOCK_MS_CONFIG, "1000"); 
> // Default is 60 seconds
> try (Producer producer = new 
> KafkaProducer<>(properties, new StringSerializer(), new StringSerializer())) {
> LOGGER.info("Sending message");
> producer.send(new ProducerRecord<>("ThisTopicDoesNotExist", 
> "key", "value"), (metadata, exception) -> {
> if (exception != null) {
> LOGGER.error("Send failed: {}", exception.toString());
> } else {
> LOGGER.info("Send successful: {}-{}/{}", 
> metadata.topic(), metadata.partition(), metadata.offset());
> }
> });
> LOGGER.info("Sending message");
> producer.send(new ProducerRecord<>("ThisTopicDoesNotExistToo", 
> "key", "value"), (metadata, exception) -> {
> if (exception != null) {
> LOGGER.error("Send failed: {}", exception.toString());
> } else {
> LOGGER.info("Send successful: {}-{}/{}", 
> metadata.topic(), metadata.partition(), metadata.offset());
> }
> });
> }
> }
> }
> {noformat}
> Related output
> {noformat}
> 2016-03-23 12:44:37.725 INFO  c.a.k.o.nosuchtopic.NoSuchTopicTest [main]: 
> Sending message (NoSuchTopicTest.java:26)
> 2016-03-23 12:44:37.830 WARN  o.a.kafka.clients.NetworkClient 
> [kafka-producer-network-thread | NoSuchTopicTest]: Error while fetching 
> metadata with correlation id 0 : 
> {ThisTopicDoesNotExist=UNKNOWN_TOPIC_OR_PARTITION} (NetworkClient.java:582)
> 2016-03-23 12:44:37.928 WARN  o.a.kafka.clients.NetworkClient 
> [kafka-producer-network-thread | NoSuchTopicTest]: Error while fetching 
> metadata with correlation id 1 : 
> {ThisTopicDoesNotExist=UNKNOWN_TOPIC_OR_PARTITION} (NetworkClient.java:582)
> 2016-03-23 12:44:38.028 WARN  o.a.kafka.clients.NetworkClient 
> [kafka-producer-network-thread | NoSuchTopicTest]: Error while fetching 
> metadata with correlation id 2 : 
> {ThisTopicDoesNotExist=UNKNOWN_TOPIC_OR_PARTITION} (NetworkClient.java:582)
> 2016-03-23 12:44:38.130 WARN  o.a.kafka.clients.NetworkClient 
> [kafka-producer-network-thread | NoSuchTopicTest]: Error while fetching 
> metadata with correlation id 3 : 
> {ThisTopicDoesNotExist=UNKNOWN_TOPIC_OR_PARTITION} (NetworkClient.java:582)
> 2016-03-23 12:44:38.231 WARN  o.a.kafka.clients.NetworkClient 
> [kafka-producer-network-thread | NoSuchTopicTest]: Error while fetching 
> metadata with correlation id 4 : 
> {ThisTopicDoesNotExist=UNKNOWN_TOPIC_OR_PARTITION} (NetworkClient.java:582)
> 2016-03-23 12:44:38.332 WARN  o.a.kafka.clients.NetworkClient 
> [kafka-producer-network-thread | NoSuchTopicTest]: Error while fetching 
> metadata with correlation id 5 : 
> {ThisTopicDoesNotExist=UNKNOWN_TOPIC_OR_PARTITION} (NetworkClient.java:582)
> 2016-03-23 12:44:38.433 WARN  o.a.kafka.clients.NetworkClient 
> [kafka-producer-network-thread | 

[jira] [Commented] (KAFKA-3539) KafkaProducer.send() may block even though it returns the Future

2020-01-10 Thread Matthias J. Sax (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-3539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013111#comment-17013111
 ] 

Matthias J. Sax commented on KAFKA-3539:


Seems KAFKA-3450 is a duplicate?

> KafkaProducer.send() may block even though it returns the Future
> 
>
> Key: KAFKA-3539
> URL: https://issues.apache.org/jira/browse/KAFKA-3539
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Reporter: Oleg Zhurakousky
>Priority: Critical
>  Labels: needs-discussion, needs-kip
>
> You can get more details from the us...@kafka.apache.org by searching on the 
> thread with the subject "KafkaProducer block on send".
> The bottom line is that method that returns Future must never block, since it 
> essentially violates the Future contract as it was specifically designed to 
> return immediately passing control back to the user to check for completion, 
> cancel etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9398) Kafka Streams main thread may not exit even after close timeout has passed

2020-01-10 Thread Matthias J. Sax (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013109#comment-17013109
 ] 

Matthias J. Sax commented on KAFKA-9398:


Other related issue: https://issues.apache.org/jira/browse/KAFKA-3450

> Kafka Streams main thread may not exit even after close timeout has passed
> --
>
> Key: KAFKA-9398
> URL: https://issues.apache.org/jira/browse/KAFKA-9398
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Bill Bejeck
>Assignee: Bill Bejeck
>Priority: Critical
> Fix For: 2.5.0
>
>
> Kafka Streams offers the KafkaStreams.close() method when shutting down a 
> Kafka Streams application. There are two overloads to this method, one that 
> takes no parameters and another taking a Duration specifying how long the 
> close() method should block waiting for streams shut down operations to 
> complete. The no-arg version of close() sets the timeout to Long.MAX_VALUE.
> The issue is that if a StreamThread is taking to long to complete or if one 
> of the Consumer or Producer clients is in a hung state, the Kafka Streams 
> application won't exit even after the specified timeout has expired.
> For example, consider this scenario:
>  # A sink topic gets deleted by accident 
>  # The user sets Producer max.block.ms config to a high value
> In this case, the Producer will issue a WARN logging statement and will 
> continue to make metadata requests looking for the expected topic. The 
> {{Producer}} will continue making metadata requests up until the max.block.ms 
> expires. If this value is high enough, calling close() with a timeout won't 
> fix the issue as when the timeout expires, the Kafka Streams application's 
> main thread won't exit.
> To prevent this type of issue, we should call Thread.interrupt() on all 
> StreamThread instances once the close() timeout has expired. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9398) Kafka Streams main thread may not exit even after close timeout has passed

2020-01-10 Thread Matthias J. Sax (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013108#comment-17013108
 ] 

Matthias J. Sax commented on KAFKA-9398:


The root cause seems to be https://issues.apache.org/jira/browse/KAFKA-3539 – 
ie, because `send()` might block, `StreamsThread` is stuck and thus cannot call 
`producer.close()`.

> Kafka Streams main thread may not exit even after close timeout has passed
> --
>
> Key: KAFKA-9398
> URL: https://issues.apache.org/jira/browse/KAFKA-9398
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Bill Bejeck
>Assignee: Bill Bejeck
>Priority: Critical
> Fix For: 2.5.0
>
>
> Kafka Streams offers the KafkaStreams.close() method when shutting down a 
> Kafka Streams application. There are two overloads to this method, one that 
> takes no parameters and another taking a Duration specifying how long the 
> close() method should block waiting for streams shut down operations to 
> complete. The no-arg version of close() sets the timeout to Long.MAX_VALUE.
> The issue is that if a StreamThread is taking to long to complete or if one 
> of the Consumer or Producer clients is in a hung state, the Kafka Streams 
> application won't exit even after the specified timeout has expired.
> For example, consider this scenario:
>  # A sink topic gets deleted by accident 
>  # The user sets Producer max.block.ms config to a high value
> In this case, the Producer will issue a WARN logging statement and will 
> continue to make metadata requests looking for the expected topic. The 
> {{Producer}} will continue making metadata requests up until the max.block.ms 
> expires. If this value is high enough, calling close() with a timeout won't 
> fix the issue as when the timeout expires, the Kafka Streams application's 
> main thread won't exit.
> To prevent this type of issue, we should call Thread.interrupt() on all 
> StreamThread instances once the close() timeout has expired. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-8764) LogCleanerManager endless loop while compacting/cleaning segments

2020-01-10 Thread Jun Rao (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013081#comment-17013081
 ] 

Jun Rao commented on KAFKA-8764:


[~trajakovic]: Thanks for providing the additional info. That makes sense now. 
Currently, the log cleaner runs independently on each replica. If a follower 
has been down for some time, it is possible that the leader has already cleaned 
the data and left holes in some log segments. When those segments get 
replicated to the follower, the follower will clean the same data again and 
potentially hit the above issue.

> LogCleanerManager endless loop while compacting/cleaning segments
> -
>
> Key: KAFKA-8764
> URL: https://issues.apache.org/jira/browse/KAFKA-8764
> Project: Kafka
>  Issue Type: Bug
>  Components: log cleaner
>Affects Versions: 2.3.0, 2.2.1, 2.4.0
> Environment: docker base image: openjdk:8-jre-alpine base image, 
> kafka from http://ftp.carnet.hr/misc/apache/kafka/2.2.1/kafka_2.12-2.2.1.tgz
>Reporter: Tomislav Rajakovic
>Priority: Major
>  Labels: patch
> Attachments: Screen Shot 2020-01-10 at 8.38.25 AM.png, 
> kafka2.4.0-KAFKA-8764.patch, kafka2.4.0-KAFKA-8764.patch, 
> log-cleaner-bug-reproduction.zip
>
>
> {{LogCleanerManager stuck in endless loop while clearing segments for one 
> partition resulting with many log outputs and heavy disk read/writes/IOPS.}}
>  
> Issue appeared on follower brokers, and it happens on every (new) broker if 
> partition assignment is changed.
>  
> Original issue setup:
>  * kafka_2.12-2.2.1 deployed as statefulset on kubernetes, 5 brokers
>  * log directory is (AWS) EBS mounted PV, gp2 (ssd) kind of 750GiB
>  * 5 zookeepers
>  * topic created with config:
>  ** name = "backup_br_domain_squad"
> partitions = 36
> replication_factor = 3
> config = {
>  "cleanup.policy" = "compact"
>  "min.compaction.lag.ms" = "8640"
>  "min.cleanable.dirty.ratio" = "0.3"
> }
>  
>  
> Log excerpt:
> {{[2019-08-07 12:10:53,895] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:53,895] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,896] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,896] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,031] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,101] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:54,101] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,101] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,101] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,173] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> 

[jira] [Commented] (KAFKA-8764) LogCleanerManager endless loop while compacting/cleaning segments

2020-01-10 Thread Tomislav Rajakovic (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013073#comment-17013073
 ] 

Tomislav Rajakovic commented on KAFKA-8764:
---

Hy [~junrao]. This issue happened on relatively busy kafka cluster, but I've 
never tinkered with kafka files, or  did any manual actions. And what's more 
indicative, this issue first time happened on follower brokers/replicas (ISRs), 
while master broker for topic-partition didn't "feel" same effect. Maybe origin 
of the problem was "first" ever cleanup on master broker, leaving holes, but 
that's just my guess.

 

And yes, I'm gonna make PR, just need to see "first good PR" and Contribution 
guideline.

 

 

> LogCleanerManager endless loop while compacting/cleaning segments
> -
>
> Key: KAFKA-8764
> URL: https://issues.apache.org/jira/browse/KAFKA-8764
> Project: Kafka
>  Issue Type: Bug
>  Components: log cleaner
>Affects Versions: 2.3.0, 2.2.1, 2.4.0
> Environment: docker base image: openjdk:8-jre-alpine base image, 
> kafka from http://ftp.carnet.hr/misc/apache/kafka/2.2.1/kafka_2.12-2.2.1.tgz
>Reporter: Tomislav Rajakovic
>Priority: Major
>  Labels: patch
> Attachments: Screen Shot 2020-01-10 at 8.38.25 AM.png, 
> kafka2.4.0-KAFKA-8764.patch, kafka2.4.0-KAFKA-8764.patch, 
> log-cleaner-bug-reproduction.zip
>
>
> {{LogCleanerManager stuck in endless loop while clearing segments for one 
> partition resulting with many log outputs and heavy disk read/writes/IOPS.}}
>  
> Issue appeared on follower brokers, and it happens on every (new) broker if 
> partition assignment is changed.
>  
> Original issue setup:
>  * kafka_2.12-2.2.1 deployed as statefulset on kubernetes, 5 brokers
>  * log directory is (AWS) EBS mounted PV, gp2 (ssd) kind of 750GiB
>  * 5 zookeepers
>  * topic created with config:
>  ** name = "backup_br_domain_squad"
> partitions = 36
> replication_factor = 3
> config = {
>  "cleanup.policy" = "compact"
>  "min.compaction.lag.ms" = "8640"
>  "min.cleanable.dirty.ratio" = "0.3"
> }
>  
>  
> Log excerpt:
> {{[2019-08-07 12:10:53,895] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:53,895] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,896] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,896] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,031] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,101] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:54,101] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,101] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,101] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,173] INFO 

[jira] [Commented] (KAFKA-8764) LogCleanerManager endless loop while compacting/cleaning segments

2020-01-10 Thread Jun Rao (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013058#comment-17013058
 ] 

Jun Rao commented on KAFKA-8764:


[~trajakovic]: Thanks for the investigation. Normally, the offset map is built 
from the dirty portion of the log that shouldn't contain holes in offsets. If 
segment [0,233) misses offset 232, it means that this segment has been cleaned 
and the dirty offset should have moved to 233 after the first round of 
cleaning. Was the dirty offset ever reset manually? In any case, I agree that 
it would be better to make the code more defensive. Could you submit the patch 
as a PR (details can be found in 
[https://cwiki.apache.org/confluence/display/KAFKA/Contributing+Code+Changes])?

> LogCleanerManager endless loop while compacting/cleaning segments
> -
>
> Key: KAFKA-8764
> URL: https://issues.apache.org/jira/browse/KAFKA-8764
> Project: Kafka
>  Issue Type: Bug
>  Components: log cleaner
>Affects Versions: 2.3.0, 2.2.1, 2.4.0
> Environment: docker base image: openjdk:8-jre-alpine base image, 
> kafka from http://ftp.carnet.hr/misc/apache/kafka/2.2.1/kafka_2.12-2.2.1.tgz
>Reporter: Tomislav Rajakovic
>Priority: Major
>  Labels: patch
> Attachments: Screen Shot 2020-01-10 at 8.38.25 AM.png, 
> kafka2.4.0-KAFKA-8764.patch, kafka2.4.0-KAFKA-8764.patch, 
> log-cleaner-bug-reproduction.zip
>
>
> {{LogCleanerManager stuck in endless loop while clearing segments for one 
> partition resulting with many log outputs and heavy disk read/writes/IOPS.}}
>  
> Issue appeared on follower brokers, and it happens on every (new) broker if 
> partition assignment is changed.
>  
> Original issue setup:
>  * kafka_2.12-2.2.1 deployed as statefulset on kubernetes, 5 brokers
>  * log directory is (AWS) EBS mounted PV, gp2 (ssd) kind of 750GiB
>  * 5 zookeepers
>  * topic created with config:
>  ** name = "backup_br_domain_squad"
> partitions = 36
> replication_factor = 3
> config = {
>  "cleanup.policy" = "compact"
>  "min.compaction.lag.ms" = "8640"
>  "min.cleanable.dirty.ratio" = "0.3"
> }
>  
>  
> Log excerpt:
> {{[2019-08-07 12:10:53,895] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:53,895] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,896] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,896] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,031] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,101] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:54,101] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,101] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,101] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  

[jira] [Commented] (KAFKA-8764) LogCleanerManager endless loop while compacting/cleaning segments

2020-01-10 Thread Jeff Nadler (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013052#comment-17013052
 ] 

Jeff Nadler commented on KAFKA-8764:


I just attached a graph of CPU usage in a small, 3-node cluster that shows this 
issue.   You can see that for all 3 nodes a massive increase in CPU usage when 
the log cleaner is going nuts - the high CPU periods correspond to tons of log 
entries in 'log-cleaner.log', and high CPU usage of the 'kafka-log-clean' 
thread.

> LogCleanerManager endless loop while compacting/cleaning segments
> -
>
> Key: KAFKA-8764
> URL: https://issues.apache.org/jira/browse/KAFKA-8764
> Project: Kafka
>  Issue Type: Bug
>  Components: log cleaner
>Affects Versions: 2.3.0, 2.2.1, 2.4.0
> Environment: docker base image: openjdk:8-jre-alpine base image, 
> kafka from http://ftp.carnet.hr/misc/apache/kafka/2.2.1/kafka_2.12-2.2.1.tgz
>Reporter: Tomislav Rajakovic
>Priority: Major
>  Labels: patch
> Attachments: Screen Shot 2020-01-10 at 8.38.25 AM.png, 
> kafka2.4.0-KAFKA-8764.patch, kafka2.4.0-KAFKA-8764.patch, 
> log-cleaner-bug-reproduction.zip
>
>
> {{LogCleanerManager stuck in endless loop while clearing segments for one 
> partition resulting with many log outputs and heavy disk read/writes/IOPS.}}
>  
> Issue appeared on follower brokers, and it happens on every (new) broker if 
> partition assignment is changed.
>  
> Original issue setup:
>  * kafka_2.12-2.2.1 deployed as statefulset on kubernetes, 5 brokers
>  * log directory is (AWS) EBS mounted PV, gp2 (ssd) kind of 750GiB
>  * 5 zookeepers
>  * topic created with config:
>  ** name = "backup_br_domain_squad"
> partitions = 36
> replication_factor = 3
> config = {
>  "cleanup.policy" = "compact"
>  "min.compaction.lag.ms" = "8640"
>  "min.cleanable.dirty.ratio" = "0.3"
> }
>  
>  
> Log excerpt:
> {{[2019-08-07 12:10:53,895] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:53,895] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,896] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,896] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,031] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,101] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:54,101] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,101] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,101] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,173] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:54,173] INFO Deleted log 
> 

[jira] [Updated] (KAFKA-8764) LogCleanerManager endless loop while compacting/cleaning segments

2020-01-10 Thread Jeff Nadler (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Nadler updated KAFKA-8764:
---
Attachment: Screen Shot 2020-01-10 at 8.38.25 AM.png

> LogCleanerManager endless loop while compacting/cleaning segments
> -
>
> Key: KAFKA-8764
> URL: https://issues.apache.org/jira/browse/KAFKA-8764
> Project: Kafka
>  Issue Type: Bug
>  Components: log cleaner
>Affects Versions: 2.3.0, 2.2.1, 2.4.0
> Environment: docker base image: openjdk:8-jre-alpine base image, 
> kafka from http://ftp.carnet.hr/misc/apache/kafka/2.2.1/kafka_2.12-2.2.1.tgz
>Reporter: Tomislav Rajakovic
>Priority: Major
>  Labels: patch
> Attachments: Screen Shot 2020-01-10 at 8.38.25 AM.png, 
> kafka2.4.0-KAFKA-8764.patch, kafka2.4.0-KAFKA-8764.patch, 
> log-cleaner-bug-reproduction.zip
>
>
> {{LogCleanerManager stuck in endless loop while clearing segments for one 
> partition resulting with many log outputs and heavy disk read/writes/IOPS.}}
>  
> Issue appeared on follower brokers, and it happens on every (new) broker if 
> partition assignment is changed.
>  
> Original issue setup:
>  * kafka_2.12-2.2.1 deployed as statefulset on kubernetes, 5 brokers
>  * log directory is (AWS) EBS mounted PV, gp2 (ssd) kind of 750GiB
>  * 5 zookeepers
>  * topic created with config:
>  ** name = "backup_br_domain_squad"
> partitions = 36
> replication_factor = 3
> config = {
>  "cleanup.policy" = "compact"
>  "min.compaction.lag.ms" = "8640"
>  "min.cleanable.dirty.ratio" = "0.3"
> }
>  
>  
> Log excerpt:
> {{[2019-08-07 12:10:53,895] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:53,895] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,896] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,896] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,031] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,101] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:54,101] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,101] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,101] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,173] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:54,173] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,173] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,173] INFO Deleted time index 
> 

[jira] [Commented] (KAFKA-9398) Kafka Streams main thread may not exit even after close timeout has passed

2020-01-10 Thread Bill Bejeck (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013044#comment-17013044
 ] 

Bill Bejeck commented on KAFKA-9398:


A PR is available for this ticket [https://github.com/apache/kafka/pull/7814]

> Kafka Streams main thread may not exit even after close timeout has passed
> --
>
> Key: KAFKA-9398
> URL: https://issues.apache.org/jira/browse/KAFKA-9398
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Bill Bejeck
>Assignee: Bill Bejeck
>Priority: Critical
> Fix For: 2.5.0
>
>
> Kafka Streams offers the KafkaStreams.close() method when shutting down a 
> Kafka Streams application. There are two overloads to this method, one that 
> takes no parameters and another taking a Duration specifying how long the 
> close() method should block waiting for streams shut down operations to 
> complete. The no-arg version of close() sets the timeout to Long.MAX_VALUE.
> The issue is that if a StreamThread is taking to long to complete or if one 
> of the Consumer or Producer clients is in a hung state, the Kafka Streams 
> application won't exit even after the specified timeout has expired.
> For example, consider this scenario:
>  # A sink topic gets deleted by accident 
>  # The user sets Producer max.block.ms config to a high value
> In this case, the Producer will issue a WARN logging statement and will 
> continue to make metadata requests looking for the expected topic. The 
> {{Producer}} will continue making metadata requests up until the max.block.ms 
> expires. If this value is high enough, calling close() with a timeout won't 
> fix the issue as when the timeout expires, the Kafka Streams application's 
> main thread won't exit.
> To prevent this type of issue, we should call Thread.interrupt() on all 
> StreamThread instances once the close() timeout has expired. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-9398) Kafka Streams main thread may not exit even after close timeout has passed

2020-01-10 Thread Bill Bejeck (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Bejeck updated KAFKA-9398:
---
Description: 
Kafka Streams offers the KafkaStreams.close() method when shutting down a Kafka 
Streams application. There are two overloads to this method, one that takes no 
parameters and another taking a Duration specifying how long the close() method 
should block waiting for streams shut down operations to complete. The no-arg 
version of close() sets the timeout to Long.MAX_VALUE.

The issue is that if a StreamThread is taking to long to complete or if one of 
the Consumer or Producer clients is in a hung state, the Kafka Streams 
application won't exit even after the specified timeout has expired.

For example, consider this scenario:
 # A sink topic gets deleted by accident 
 # The user sets Producer max.block.ms config to a high value

In this case, the Producer will issue a WARN logging statement and will 
continue to make metadata requests looking for the expected topic. The 
{{Producer}} will continue making metadata requests up until the max.block.ms 
expires. If this value is high enough, calling close() with a timeout won't fix 
the issue as when the timeout expires, the Kafka Streams application's main 
thread won't exit.

To prevent this type of issue, we should call Thread.interrupt() on all 
StreamThread instances once the close() timeout has expired. 

  was:
Kafka Streams offers the KafkaStreams.close() method when shutting down a Kafka 
Streams application. There are two overloads to this method, one that takes no 
parameters and another taking a Duration specifying how long the close() method 
should block waiting for streams shut down operations to complete. The no-arg 
version of close() sets the timeout to Long.MAX_VALUE.

 

The issue is that if a StreamThread is taking to long to complete or if one of 
the Consumer or Producer clients is in a hung state, the Kafka Streams 
application won't exit even after the specified timeout has expired.

For example, consider this scenario:
 # A sink topic gets deleted by accident 
 # The user sets Producer max.block.ms config to a high value

In this case, the Producer will issue a WARN logging statement and will 
continue to make metadata requests looking for the expected topic. The 
\{{Producer}} will continue making metadata requests up until the max.block.ms 
expires. If this value is high enough, calling close() with a timeout won't fix 
the issue as when the timeout expires, the Kafka Streams application's main 
thread won't exit.

To prevent this type of issue, we should call Thread.interrupt() on all 
StreamThread instances once the close() timeout has expired. 


> Kafka Streams main thread may not exit even after close timeout has passed
> --
>
> Key: KAFKA-9398
> URL: https://issues.apache.org/jira/browse/KAFKA-9398
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Bill Bejeck
>Assignee: Bill Bejeck
>Priority: Critical
> Fix For: 2.5.0
>
>
> Kafka Streams offers the KafkaStreams.close() method when shutting down a 
> Kafka Streams application. There are two overloads to this method, one that 
> takes no parameters and another taking a Duration specifying how long the 
> close() method should block waiting for streams shut down operations to 
> complete. The no-arg version of close() sets the timeout to Long.MAX_VALUE.
> The issue is that if a StreamThread is taking to long to complete or if one 
> of the Consumer or Producer clients is in a hung state, the Kafka Streams 
> application won't exit even after the specified timeout has expired.
> For example, consider this scenario:
>  # A sink topic gets deleted by accident 
>  # The user sets Producer max.block.ms config to a high value
> In this case, the Producer will issue a WARN logging statement and will 
> continue to make metadata requests looking for the expected topic. The 
> {{Producer}} will continue making metadata requests up until the max.block.ms 
> expires. If this value is high enough, calling close() with a timeout won't 
> fix the issue as when the timeout expires, the Kafka Streams application's 
> main thread won't exit.
> To prevent this type of issue, we should call Thread.interrupt() on all 
> StreamThread instances once the close() timeout has expired. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-9398) Kafka Streams main thread may not exit even after close timeout has passed

2020-01-10 Thread Bill Bejeck (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Bejeck updated KAFKA-9398:
---
Description: 
Kafka Streams offers the KafkaStreams.close() method when shutting down a Kafka 
Streams application. There are two overloads to this method, one that takes no 
parameters and another taking a Duration specifying how long the close() method 
should block waiting for streams shut down operations to complete. The no-arg 
version of close() sets the timeout to Long.MAX_VALUE.

 

The issue is that if a StreamThread is taking to long to complete or if one of 
the Consumer or Producer clients is in a hung state, the Kafka Streams 
application won't exit even after the specified timeout has expired.

For example, consider this scenario:
 # A sink topic gets deleted by accident 
 # The user sets Producer max.block.ms config to a high value

In this case, the Producer will issue a WARN logging statement and will 
continue to make metadata requests looking for the expected topic. The 
\{{Producer}} will continue making metadata requests up until the max.block.ms 
expires. If this value is high enough, calling close() with a timeout won't fix 
the issue as when the timeout expires, the Kafka Streams application's main 
thread won't exit.

To prevent this type of issue, we should call Thread.interrupt() on all 
StreamThread instances once the close() timeout has expired. 

  was:
Kafka Streams offers the {{KafkaStreams.close()}} method when shutting down a 
Kafka Streams application.  There are two overloads to this method, one that 
takes no parameters and another taking a {{Duration}} specifying how long the 
{{close()}} method should block waiting for streams shutdown operations to 
complete.  The no-arg version of {{close()}} sets the timeout to 
{{Long.MAX_VALUE}}.

 

The issue is that if a {{StreamThread}} is some how hung or if one of the 
{{Consumer}} or {{Producer}} clients are in a hung state, the Kafka Streams 
application won't exit even after the specified timeout has expired.

For example consider this scenario:
 # A sink topic gets deleted by accident 
 # The {{Producer max.block.ms}} config is set to high value

In this case the {{Producer}} will issue a {{WARN}} logging statement and will 
continue to make metadata requests looking for the expected topic.  This will 
continue up until the {{max.block.ms}} expires.  If this value is high enough, 
calling {{close()}} with a timeout won't fix the issue as when the timeout 
expires, the Kafka Streams application main thread won't exit.

To prevent this type of issue, we should call {{Thread.interrupt()}} on all 
{{StreamThread}} instances once the {{close()}} timeout has expired. 


> Kafka Streams main thread may not exit even after close timeout has passed
> --
>
> Key: KAFKA-9398
> URL: https://issues.apache.org/jira/browse/KAFKA-9398
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Bill Bejeck
>Assignee: Bill Bejeck
>Priority: Critical
> Fix For: 2.5.0
>
>
> Kafka Streams offers the KafkaStreams.close() method when shutting down a 
> Kafka Streams application. There are two overloads to this method, one that 
> takes no parameters and another taking a Duration specifying how long the 
> close() method should block waiting for streams shut down operations to 
> complete. The no-arg version of close() sets the timeout to Long.MAX_VALUE.
>  
> The issue is that if a StreamThread is taking to long to complete or if one 
> of the Consumer or Producer clients is in a hung state, the Kafka Streams 
> application won't exit even after the specified timeout has expired.
> For example, consider this scenario:
>  # A sink topic gets deleted by accident 
>  # The user sets Producer max.block.ms config to a high value
> In this case, the Producer will issue a WARN logging statement and will 
> continue to make metadata requests looking for the expected topic. The 
> \{{Producer}} will continue making metadata requests up until the 
> max.block.ms expires. If this value is high enough, calling close() with a 
> timeout won't fix the issue as when the timeout expires, the Kafka Streams 
> application's main thread won't exit.
> To prevent this type of issue, we should call Thread.interrupt() on all 
> StreamThread instances once the close() timeout has expired. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-9398) Kafka Streams main thread may not exit even after close timeout has passed

2020-01-10 Thread Bill Bejeck (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Bejeck updated KAFKA-9398:
---
Priority: Critical  (was: Major)

> Kafka Streams main thread may not exit even after close timeout has passed
> --
>
> Key: KAFKA-9398
> URL: https://issues.apache.org/jira/browse/KAFKA-9398
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Bill Bejeck
>Assignee: Bill Bejeck
>Priority: Critical
> Fix For: 2.5.0
>
>
> Kafka Streams offers the {{KafkaStreams.close()}} method when shutting down a 
> Kafka Streams application.  There are two overloads to this method, one that 
> takes no parameters and another taking a {{Duration}} specifying how long the 
> {{close()}} method should block waiting for streams shutdown operations to 
> complete.  The no-arg version of {{close()}} sets the timeout to 
> {{Long.MAX_VALUE}}.
>  
> The issue is that if a {{StreamThread}} is some how hung or if one of the 
> {{Consumer}} or {{Producer}} clients are in a hung state, the Kafka Streams 
> application won't exit even after the specified timeout has expired.
> For example consider this scenario:
>  # A sink topic gets deleted by accident 
>  # The {{Producer max.block.ms}} config is set to high value
> In this case the {{Producer}} will issue a {{WARN}} logging statement and 
> will continue to make metadata requests looking for the expected topic.  This 
> will continue up until the {{max.block.ms}} expires.  If this value is high 
> enough, calling {{close()}} with a timeout won't fix the issue as when the 
> timeout expires, the Kafka Streams application main thread won't exit.
> To prevent this type of issue, we should call {{Thread.interrupt()}} on all 
> {{StreamThread}} instances once the {{close()}} timeout has expired. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-9398) Kafka Streams main thread may not exit even after close timeout has passed

2020-01-10 Thread Bill Bejeck (Jira)
Bill Bejeck created KAFKA-9398:
--

 Summary: Kafka Streams main thread may not exit even after close 
timeout has passed
 Key: KAFKA-9398
 URL: https://issues.apache.org/jira/browse/KAFKA-9398
 Project: Kafka
  Issue Type: Improvement
  Components: streams
Reporter: Bill Bejeck
Assignee: Bill Bejeck
 Fix For: 2.5.0


Kafka Streams offers the {{KafkaStreams.close()}} method when shutting down a 
Kafka Streams application.  There are two overloads to this method, one that 
takes no parameters and another taking a {{Duration}} specifying how long the 
{{close()}} method should block waiting for streams shutdown operations to 
complete.  The no-arg version of {{close()}} sets the timeout to 
{{Long.MAX_VALUE}}.

 

The issue is that if a {{StreamThread}} is some how hung or if one of the 
{{Consumer}} or {{Producer}} clients are in a hung state, the Kafka Streams 
application won't exit even after the specified timeout has expired.

For example consider this scenario:
 # A sink topic gets deleted by accident 
 # The {{Producer max.block.ms}} config is set to high value

In this case the {{Producer}} will issue a {{WARN}} logging statement and will 
continue to make metadata requests looking for the expected topic.  This will 
continue up until the {{max.block.ms}} expires.  If this value is high enough, 
calling {{close()}} with a timeout won't fix the issue as when the timeout 
expires, the Kafka Streams application main thread won't exit.

To prevent this type of issue, we should call {{Thread.interrupt()}} on all 
{{StreamThread}} instances once the {{close()}} timeout has expired. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9152) Improve Sensor Retrieval

2020-01-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013029#comment-17013029
 ] 

ASF GitHub Bot commented on KAFKA-9152:
---

highluck commented on pull request #7928: KAFKA-9152; Improve Sensor Retrieval
URL: https://github.com/apache/kafka/pull/7928
 
 
   @cadonna 
   I fixed it
   I'm sorry but can you confirm it once more?
   
   *More detailed description of your change,
   if necessary. The PR title and PR message become
   the squashed commit message, so use a separate
   comment to ping reviewers.*
   
   *Summary of testing strategy (including rationale)
   for the feature or bug fix. Unit and/or integration
   tests are expected for any behaviour change and
   system tests should be considered for larger changes.*
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Improve Sensor Retrieval 
> -
>
> Key: KAFKA-9152
> URL: https://issues.apache.org/jira/browse/KAFKA-9152
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Bruno Cadonna
>Assignee: highluck
>Priority: Minor
>  Labels: newbie, tech-debt
>
> This ticket shall improve two aspects of the retrieval of sensors:
> 1. Currently, when a sensor is retrieved with {{*Metrics.*Sensor()}} (e.g. 
> {{ThreadMetrics.createTaskSensor()}}) after it was created with the same 
> method {{*Metrics.*Sensor()}}, the sensor is added again to the corresponding 
> queue in {{*Sensors}} (e.g. {{threadLevelSensors}}) in 
> {{StreamsMetricsImpl}}. Those queues are used to remove the sensors when 
> {{removeAll*LevelSensors()}} is called. Having multiple times the same 
> sensors in this queue is not an issue from a correctness point of view. 
> However, it would reduce the footprint to only store a sensor once in those 
> queues.
> 2. When a sensor is retrieved, the current code attempts to create a new 
> sensor and to add to it again the corresponding metrics. This could be 
> avoided.
>  
> Both aspects could be improved by checking whether a sensor already exists by 
> calling {{getSensor()}} on the {{Metrics}} object and checking the return 
> value.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KAFKA-9389) Document how to use kafka-reassign-partitions.sh to change log dirs for a partition

2020-01-10 Thread Mitchell (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mitchell reassigned KAFKA-9389:
---

Assignee: Mitchell

> Document how to use kafka-reassign-partitions.sh to change log dirs for a 
> partition
> ---
>
> Key: KAFKA-9389
> URL: https://issues.apache.org/jira/browse/KAFKA-9389
> Project: Kafka
>  Issue Type: Improvement
>Reporter: James Cheng
>Assignee: Mitchell
>Priority: Major
>  Labels: newbie
>
> KIP-113 introduced support for moving replicas between log directories. As 
> part of it, support was added to kafka-reassign-partitions.sh so that users 
> can move replicas between log directories. Specifically, when you call 
> "kafka-reassign-partitions.sh --topics-to-move-json-file 
> topics-to-move.json", you can specify a "log_dirs" key in the 
> topics-to-move.json file, and kafka-reassign-partitions.sh will then move 
> those replicas to those directories.
>  
> However, when working on that KIP, we didn't update the docs on 
> kafka.apache.org to describe how to use the new functionality. We should add 
> documentation on that.
>  
> I haven't used it before, but whoever works on this Jira can probably figure 
> it out by experimentation with kafka-reassign-partitions.sh, or by reading 
> KIP-113 page or the associated JIRAs.
>  * 
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-113%3A+Support+replicas+movement+between+log+directories]
>  * KAFKA-5163
>  * KAFKA-5694
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9152) Improve Sensor Retrieval

2020-01-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17012987#comment-17012987
 ] 

ASF GitHub Bot commented on KAFKA-9152:
---

highluck commented on pull request #7914: KAFKA-9152; Improve Sensor Retrieval
URL: https://github.com/apache/kafka/pull/7914
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Improve Sensor Retrieval 
> -
>
> Key: KAFKA-9152
> URL: https://issues.apache.org/jira/browse/KAFKA-9152
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Bruno Cadonna
>Assignee: highluck
>Priority: Minor
>  Labels: newbie, tech-debt
>
> This ticket shall improve two aspects of the retrieval of sensors:
> 1. Currently, when a sensor is retrieved with {{*Metrics.*Sensor()}} (e.g. 
> {{ThreadMetrics.createTaskSensor()}}) after it was created with the same 
> method {{*Metrics.*Sensor()}}, the sensor is added again to the corresponding 
> queue in {{*Sensors}} (e.g. {{threadLevelSensors}}) in 
> {{StreamsMetricsImpl}}. Those queues are used to remove the sensors when 
> {{removeAll*LevelSensors()}} is called. Having multiple times the same 
> sensors in this queue is not an issue from a correctness point of view. 
> However, it would reduce the footprint to only store a sensor once in those 
> queues.
> 2. When a sensor is retrieved, the current code attempts to create a new 
> sensor and to add to it again the corresponding metrics. This could be 
> avoided.
>  
> Both aspects could be improved by checking whether a sensor already exists by 
> calling {{getSensor()}} on the {{Metrics}} object and checking the return 
> value.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Issue Comment Deleted] (KAFKA-8764) LogCleanerManager endless loop while compacting/cleaning segments

2020-01-10 Thread Tomislav Rajakovic (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomislav Rajakovic updated KAFKA-8764:
--
Comment: was deleted

(was: I think that this [^kafka2.4.0-KAFKA-8764.patch] resolves LogCleaner 
issue(s).)

> LogCleanerManager endless loop while compacting/cleaning segments
> -
>
> Key: KAFKA-8764
> URL: https://issues.apache.org/jira/browse/KAFKA-8764
> Project: Kafka
>  Issue Type: Bug
>  Components: log cleaner
>Affects Versions: 2.3.0, 2.2.1, 2.4.0
> Environment: docker base image: openjdk:8-jre-alpine base image, 
> kafka from http://ftp.carnet.hr/misc/apache/kafka/2.2.1/kafka_2.12-2.2.1.tgz
>Reporter: Tomislav Rajakovic
>Priority: Major
>  Labels: patch
> Attachments: kafka2.4.0-KAFKA-8764.patch, 
> kafka2.4.0-KAFKA-8764.patch, log-cleaner-bug-reproduction.zip
>
>
> {{LogCleanerManager stuck in endless loop while clearing segments for one 
> partition resulting with many log outputs and heavy disk read/writes/IOPS.}}
>  
> Issue appeared on follower brokers, and it happens on every (new) broker if 
> partition assignment is changed.
>  
> Original issue setup:
>  * kafka_2.12-2.2.1 deployed as statefulset on kubernetes, 5 brokers
>  * log directory is (AWS) EBS mounted PV, gp2 (ssd) kind of 750GiB
>  * 5 zookeepers
>  * topic created with config:
>  ** name = "backup_br_domain_squad"
> partitions = 36
> replication_factor = 3
> config = {
>  "cleanup.policy" = "compact"
>  "min.compaction.lag.ms" = "8640"
>  "min.cleanable.dirty.ratio" = "0.3"
> }
>  
>  
> Log excerpt:
> {{[2019-08-07 12:10:53,895] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:53,895] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,896] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,896] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,031] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,101] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:54,101] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,101] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,101] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,173] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:54,173] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,173] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,173] INFO 

[jira] [Commented] (KAFKA-8764) LogCleanerManager endless loop while compacting/cleaning segments

2020-01-10 Thread Tomislav Rajakovic (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17012952#comment-17012952
 ] 

Tomislav Rajakovic commented on KAFKA-8764:
---

I think that this [^kafka2.4.0-KAFKA-8764.patch] resolves LogCleaner issue(s).

> LogCleanerManager endless loop while compacting/cleaning segments
> -
>
> Key: KAFKA-8764
> URL: https://issues.apache.org/jira/browse/KAFKA-8764
> Project: Kafka
>  Issue Type: Bug
>  Components: log cleaner
>Affects Versions: 2.3.0, 2.2.1, 2.4.0
> Environment: docker base image: openjdk:8-jre-alpine base image, 
> kafka from http://ftp.carnet.hr/misc/apache/kafka/2.2.1/kafka_2.12-2.2.1.tgz
>Reporter: Tomislav Rajakovic
>Priority: Major
> Attachments: kafka2.4.0-KAFKA-8764.patch, 
> log-cleaner-bug-reproduction.zip
>
>
> {{LogCleanerManager stuck in endless loop while clearing segments for one 
> partition resulting with many log outputs and heavy disk read/writes/IOPS.}}
>  
> Issue appeared on follower brokers, and it happens on every (new) broker if 
> partition assignment is changed.
>  
> Original issue setup:
>  * kafka_2.12-2.2.1 deployed as statefulset on kubernetes, 5 brokers
>  * log directory is (AWS) EBS mounted PV, gp2 (ssd) kind of 750GiB
>  * 5 zookeepers
>  * topic created with config:
>  ** name = "backup_br_domain_squad"
> partitions = 36
> replication_factor = 3
> config = {
>  "cleanup.policy" = "compact"
>  "min.compaction.lag.ms" = "8640"
>  "min.cleanable.dirty.ratio" = "0.3"
> }
>  
>  
> Log excerpt:
> {{[2019-08-07 12:10:53,895] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:53,895] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,896] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,896] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,031] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,101] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:54,101] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,101] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,101] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,173] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:54,173] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,173] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,173] INFO Deleted time index 
> 

[jira] [Updated] (KAFKA-8764) LogCleanerManager endless loop while compacting/cleaning segments

2020-01-10 Thread Tomislav Rajakovic (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomislav Rajakovic updated KAFKA-8764:
--
Attachment: kafka2.4.0-KAFKA-8764.patch

> LogCleanerManager endless loop while compacting/cleaning segments
> -
>
> Key: KAFKA-8764
> URL: https://issues.apache.org/jira/browse/KAFKA-8764
> Project: Kafka
>  Issue Type: Bug
>  Components: log cleaner
>Affects Versions: 2.3.0, 2.2.1, 2.4.0
> Environment: docker base image: openjdk:8-jre-alpine base image, 
> kafka from http://ftp.carnet.hr/misc/apache/kafka/2.2.1/kafka_2.12-2.2.1.tgz
>Reporter: Tomislav Rajakovic
>Priority: Major
> Attachments: kafka2.4.0-KAFKA-8764.patch, 
> log-cleaner-bug-reproduction.zip
>
>
> {{LogCleanerManager stuck in endless loop while clearing segments for one 
> partition resulting with many log outputs and heavy disk read/writes/IOPS.}}
>  
> Issue appeared on follower brokers, and it happens on every (new) broker if 
> partition assignment is changed.
>  
> Original issue setup:
>  * kafka_2.12-2.2.1 deployed as statefulset on kubernetes, 5 brokers
>  * log directory is (AWS) EBS mounted PV, gp2 (ssd) kind of 750GiB
>  * 5 zookeepers
>  * topic created with config:
>  ** name = "backup_br_domain_squad"
> partitions = 36
> replication_factor = 3
> config = {
>  "cleanup.policy" = "compact"
>  "min.compaction.lag.ms" = "8640"
>  "min.cleanable.dirty.ratio" = "0.3"
> }
>  
>  
> Log excerpt:
> {{[2019-08-07 12:10:53,895] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:53,895] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,896] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,896] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,031] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,101] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:54,101] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,101] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,101] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,173] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:54,173] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,173] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,173] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  

[jira] [Resolved] (KAFKA-9188) Flaky Test SslAdminClientIntegrationTest.testSynchronousAuthorizerAclUpdatesBlockRequestThreads

2020-01-10 Thread Rajini Sivaram (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajini Sivaram resolved KAFKA-9188.
---
Fix Version/s: 2.5.0
 Reviewer: Manikumar
   Resolution: Fixed

> Flaky Test 
> SslAdminClientIntegrationTest.testSynchronousAuthorizerAclUpdatesBlockRequestThreads
> ---
>
> Key: KAFKA-9188
> URL: https://issues.apache.org/jira/browse/KAFKA-9188
> Project: Kafka
>  Issue Type: Test
>  Components: core
>Reporter: Bill Bejeck
>Assignee: Rajini Sivaram
>Priority: Major
>  Labels: flaky-test
> Fix For: 2.5.0
>
>
> Failed in 
> [https://builds.apache.org/job/kafka-pr-jdk11-scala2.12/9373/testReport/junit/kafka.api/SslAdminClientIntegrationTest/testSynchronousAuthorizerAclUpdatesBlockRequestThreads/]
>  
> {noformat}
> Error Messagejava.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.TimeoutException: Aborted due to 
> timeout.Stacktracejava.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.TimeoutException: Aborted due to timeout.
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45)
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32)
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:89)
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:260)
>   at 
> kafka.api.SslAdminClientIntegrationTest.$anonfun$testSynchronousAuthorizerAclUpdatesBlockRequestThreads$1(SslAdminClientIntegrationTest.scala:201)
>   at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at 
> kafka.api.SslAdminClientIntegrationTest.testSynchronousAuthorizerAclUpdatesBlockRequestThreads(SslAdminClientIntegrationTest.scala:201)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:288)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:282)
>   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>   at java.base/java.lang.Thread.run(Thread.java:834)
> Caused by: org.apache.kafka.common.errors.TimeoutException: Aborted due to 
> timeout.
> Standard Output[2019-11-14 15:13:51,489] ERROR [ReplicaFetcher replicaId=1, 
> leaderId=2, fetcherId=0] Error for partition mytopic1-1 at offset 0 
> (kafka.server.ReplicaFetcherThread:76)
> org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
> does not host this topic-partition.
> [2019-11-14 15:13:51,490] ERROR [ReplicaFetcher replicaId=1, leaderId=0, 
> fetcherId=0] Error for partition mytopic1-0 at offset 0 
> (kafka.server.ReplicaFetcherThread:76)
> org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
> does not host this topic-partition.
> [2019-11-14 15:14:04,686] ERROR [KafkaApi-2] Error when handling request: 
> clientId=adminclient-644, correlationId=4, api=CREATE_ACLS, version=1, 
> body={creations=[{resource_type=2,resource_name=foobar,resource_pattern_type=3,principal=User:ANONYMOUS,host=*,operation=3,permission_type=3},{resource_type=5,resource_name=transactional_id,resource_pattern_type=3,principal=User:ANONYMOUS,host=*,operation=4,permission_type=3}]}
>  (kafka.server.KafkaApis:76)
> org.apache.kafka.common.errors.ClusterAuthorizationException: Request 
> Request(processor=1, connectionId=127.0.0.1:41993-127.0.0.1:34770-0, 
> session=Session(User:ANONYMOUS,/127.0.0.1), listenerName=ListenerName(SSL), 
> 

[jira] [Commented] (KAFKA-9188) Flaky Test SslAdminClientIntegrationTest.testSynchronousAuthorizerAclUpdatesBlockRequestThreads

2020-01-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17012742#comment-17012742
 ] 

ASF GitHub Bot commented on KAFKA-9188:
---

rajinisivaram commented on pull request #7918: KAFKA-9188; Fix flaky test 
SslAdminClientIntegrationTest.testSynchronousAuthorizerAclUpdatesBlockRequestThreads
URL: https://github.com/apache/kafka/pull/7918
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Flaky Test 
> SslAdminClientIntegrationTest.testSynchronousAuthorizerAclUpdatesBlockRequestThreads
> ---
>
> Key: KAFKA-9188
> URL: https://issues.apache.org/jira/browse/KAFKA-9188
> Project: Kafka
>  Issue Type: Test
>  Components: core
>Reporter: Bill Bejeck
>Assignee: Rajini Sivaram
>Priority: Major
>  Labels: flaky-test
>
> Failed in 
> [https://builds.apache.org/job/kafka-pr-jdk11-scala2.12/9373/testReport/junit/kafka.api/SslAdminClientIntegrationTest/testSynchronousAuthorizerAclUpdatesBlockRequestThreads/]
>  
> {noformat}
> Error Messagejava.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.TimeoutException: Aborted due to 
> timeout.Stacktracejava.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.TimeoutException: Aborted due to timeout.
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45)
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32)
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:89)
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:260)
>   at 
> kafka.api.SslAdminClientIntegrationTest.$anonfun$testSynchronousAuthorizerAclUpdatesBlockRequestThreads$1(SslAdminClientIntegrationTest.scala:201)
>   at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at 
> kafka.api.SslAdminClientIntegrationTest.testSynchronousAuthorizerAclUpdatesBlockRequestThreads(SslAdminClientIntegrationTest.scala:201)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:288)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:282)
>   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>   at java.base/java.lang.Thread.run(Thread.java:834)
> Caused by: org.apache.kafka.common.errors.TimeoutException: Aborted due to 
> timeout.
> Standard Output[2019-11-14 15:13:51,489] ERROR [ReplicaFetcher replicaId=1, 
> leaderId=2, fetcherId=0] Error for partition mytopic1-1 at offset 0 
> (kafka.server.ReplicaFetcherThread:76)
> org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
> does not host this topic-partition.
> [2019-11-14 15:13:51,490] ERROR [ReplicaFetcher replicaId=1, leaderId=0, 
> fetcherId=0] Error for partition mytopic1-0 at offset 0 
> (kafka.server.ReplicaFetcherThread:76)
> org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server 
> does not host this topic-partition.
> [2019-11-14 15:14:04,686] ERROR [KafkaApi-2] Error when handling request: 
> clientId=adminclient-644, correlationId=4, api=CREATE_ACLS, version=1, 
> 

[jira] [Comment Edited] (KAFKA-8764) LogCleanerManager endless loop while compacting/cleaning segments

2020-01-10 Thread Tomislav Rajakovic (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17012683#comment-17012683
 ] 

Tomislav Rajakovic edited comment on KAFKA-8764 at 1/10/20 10:24 AM:
-

[~jnadler] and [~seva.f] so it seems that they still didn't fix this issue, or 
it happens in rare condition(s). I know that in time when I was solving this 
issue, I've tested it with multiple versions (of 2.x.x) and all had same 
problem. Although I'm not so much fluent in Scala, I could probably compile 
latest version and give some hints about patching this issue out (got some 
clues back then when issue occurred).

 

Btw. voting for this issue might get faster help/attention ;).


was (Author: trajakovic):
[~jnadler] and [~seva.f] so it seems that they still didn't fix this issue, or 
it happens in rare condition(s). I know that in time when I was solving this 
issue, I've tested it with multiple versions (of 2.x.x) and all had same 
problem. Although I'm not so much fluent in Scala, I could probably compile 
latest version and give some hints about patching this issue out (got some 
clues back then when issue occurred).

> LogCleanerManager endless loop while compacting/cleaning segments
> -
>
> Key: KAFKA-8764
> URL: https://issues.apache.org/jira/browse/KAFKA-8764
> Project: Kafka
>  Issue Type: Bug
>  Components: log cleaner
>Affects Versions: 2.3.0, 2.2.1, 2.4.0
> Environment: docker base image: openjdk:8-jre-alpine base image, 
> kafka from http://ftp.carnet.hr/misc/apache/kafka/2.2.1/kafka_2.12-2.2.1.tgz
>Reporter: Tomislav Rajakovic
>Priority: Major
> Attachments: log-cleaner-bug-reproduction.zip
>
>
> {{LogCleanerManager stuck in endless loop while clearing segments for one 
> partition resulting with many log outputs and heavy disk read/writes/IOPS.}}
>  
> Issue appeared on follower brokers, and it happens on every (new) broker if 
> partition assignment is changed.
>  
> Original issue setup:
>  * kafka_2.12-2.2.1 deployed as statefulset on kubernetes, 5 brokers
>  * log directory is (AWS) EBS mounted PV, gp2 (ssd) kind of 750GiB
>  * 5 zookeepers
>  * topic created with config:
>  ** name = "backup_br_domain_squad"
> partitions = 36
> replication_factor = 3
> config = {
>  "cleanup.policy" = "compact"
>  "min.compaction.lag.ms" = "8640"
>  "min.cleanable.dirty.ratio" = "0.3"
> }
>  
>  
> Log excerpt:
> {{[2019-08-07 12:10:53,895] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:53,895] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,896] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,896] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,031] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,101] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:54,101] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,101] INFO Deleted offset 

[jira] [Updated] (KAFKA-8764) LogCleanerManager endless loop while compacting/cleaning segments

2020-01-10 Thread Tomislav Rajakovic (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomislav Rajakovic updated KAFKA-8764:
--
Affects Version/s: 2.4.0

> LogCleanerManager endless loop while compacting/cleaning segments
> -
>
> Key: KAFKA-8764
> URL: https://issues.apache.org/jira/browse/KAFKA-8764
> Project: Kafka
>  Issue Type: Bug
>  Components: log cleaner
>Affects Versions: 2.3.0, 2.2.1, 2.4.0
> Environment: docker base image: openjdk:8-jre-alpine base image, 
> kafka from http://ftp.carnet.hr/misc/apache/kafka/2.2.1/kafka_2.12-2.2.1.tgz
>Reporter: Tomislav Rajakovic
>Priority: Major
> Attachments: log-cleaner-bug-reproduction.zip
>
>
> {{LogCleanerManager stuck in endless loop while clearing segments for one 
> partition resulting with many log outputs and heavy disk read/writes/IOPS.}}
>  
> Issue appeared on follower brokers, and it happens on every (new) broker if 
> partition assignment is changed.
>  
> Original issue setup:
>  * kafka_2.12-2.2.1 deployed as statefulset on kubernetes, 5 brokers
>  * log directory is (AWS) EBS mounted PV, gp2 (ssd) kind of 750GiB
>  * 5 zookeepers
>  * topic created with config:
>  ** name = "backup_br_domain_squad"
> partitions = 36
> replication_factor = 3
> config = {
>  "cleanup.policy" = "compact"
>  "min.compaction.lag.ms" = "8640"
>  "min.cleanable.dirty.ratio" = "0.3"
> }
>  
>  
> Log excerpt:
> {{[2019-08-07 12:10:53,895] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:53,895] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,896] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,896] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,031] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,101] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:54,101] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,101] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,101] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,173] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:54,173] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,173] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,173] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
>  
> And such log keeps 

[jira] [Commented] (KAFKA-8764) LogCleanerManager endless loop while compacting/cleaning segments

2020-01-10 Thread Tomislav Rajakovic (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17012683#comment-17012683
 ] 

Tomislav Rajakovic commented on KAFKA-8764:
---

[~jnadler] and [~seva.f] so it seems that they still didn't fix this issue, or 
it happens in rare condition(s). I know that in time when I was solving this 
issue, I've tested it with multiple versions (of 2.x.x) and all had same 
problem. Although I'm not so much fluent in Scala, I could probably compile 
latest version and give some hints about patching this issue out (got some 
clues back then when issue occurred).

> LogCleanerManager endless loop while compacting/cleaning segments
> -
>
> Key: KAFKA-8764
> URL: https://issues.apache.org/jira/browse/KAFKA-8764
> Project: Kafka
>  Issue Type: Bug
>  Components: log cleaner
>Affects Versions: 2.3.0, 2.2.1
> Environment: docker base image: openjdk:8-jre-alpine base image, 
> kafka from http://ftp.carnet.hr/misc/apache/kafka/2.2.1/kafka_2.12-2.2.1.tgz
>Reporter: Tomislav Rajakovic
>Priority: Major
> Attachments: log-cleaner-bug-reproduction.zip
>
>
> {{LogCleanerManager stuck in endless loop while clearing segments for one 
> partition resulting with many log outputs and heavy disk read/writes/IOPS.}}
>  
> Issue appeared on follower brokers, and it happens on every (new) broker if 
> partition assignment is changed.
>  
> Original issue setup:
>  * kafka_2.12-2.2.1 deployed as statefulset on kubernetes, 5 brokers
>  * log directory is (AWS) EBS mounted PV, gp2 (ssd) kind of 750GiB
>  * 5 zookeepers
>  * topic created with config:
>  ** name = "backup_br_domain_squad"
> partitions = 36
> replication_factor = 3
> config = {
>  "cleanup.policy" = "compact"
>  "min.compaction.lag.ms" = "8640"
>  "min.cleanable.dirty.ratio" = "0.3"
> }
>  
>  
> Log excerpt:
> {{[2019-08-07 12:10:53,895] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:53,895] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,896] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,896] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,031] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,101] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:54,101] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,101] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,101] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,173] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:54,173] INFO Deleted log 
> 

[jira] [Commented] (KAFKA-2758) Improve Offset Commit Behavior

2020-01-10 Thread highluck (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-2758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17012672#comment-17012672
 ] 

highluck commented on KAFKA-2758:
-

[~guozhang]

Is this issue still open

 

> Improve Offset Commit Behavior
> --
>
> Key: KAFKA-2758
> URL: https://issues.apache.org/jira/browse/KAFKA-2758
> Project: Kafka
>  Issue Type: Improvement
>  Components: consumer
>Reporter: Guozhang Wang
>Priority: Major
>  Labels: newbie, reliability
>
> There are two scenarios of offset committing that we can improve:
> 1) we can filter the partitions whose committed offset is equal to the 
> consumed offset, meaning there is no new consumed messages from this 
> partition and hence we do not need to include this partition in the commit 
> request.
> 2) we can make a commit request right after resetting to a fetch / consume 
> position either according to the reset policy (e.g. on consumer starting up, 
> or handling of out of range offset, etc), or through the {code} seek {code} 
> so that if the consumer fails right after these event, upon recovery it can 
> restarts from the reset position instead of resetting again: this can lead 
> to, for example, data loss if we use "largest" as reset policy while there 
> are new messages coming to the fetching partitions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (KAFKA-2758) Improve Offset Commit Behavior

2020-01-10 Thread highluck (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-2758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17012672#comment-17012672
 ] 

highluck edited comment on KAFKA-2758 at 1/10/20 10:16 AM:
---

[~guozhang]

Is this issue still open?

 


was (Author: high.lee):
[~guozhang]

Is this issue still open

 

> Improve Offset Commit Behavior
> --
>
> Key: KAFKA-2758
> URL: https://issues.apache.org/jira/browse/KAFKA-2758
> Project: Kafka
>  Issue Type: Improvement
>  Components: consumer
>Reporter: Guozhang Wang
>Priority: Major
>  Labels: newbie, reliability
>
> There are two scenarios of offset committing that we can improve:
> 1) we can filter the partitions whose committed offset is equal to the 
> consumed offset, meaning there is no new consumed messages from this 
> partition and hence we do not need to include this partition in the commit 
> request.
> 2) we can make a commit request right after resetting to a fetch / consume 
> position either according to the reset policy (e.g. on consumer starting up, 
> or handling of out of range offset, etc), or through the {code} seek {code} 
> so that if the consumer fails right after these event, upon recovery it can 
> restarts from the reset position instead of resetting again: this can lead 
> to, for example, data loss if we use "largest" as reset policy while there 
> are new messages coming to the fetching partitions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-8764) LogCleanerManager endless loop while compacting/cleaning segments

2020-01-10 Thread Seva Feldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17012667#comment-17012667
 ] 

Seva Feldman commented on KAFKA-8764:
-

Hi,

We have the exact same issue with __consumer_offset compacted topic which kills 
our consumer groups. Thanks, [~trajakovic], for the solution on manually update 
*cleaner-offset-checkpoint file.*

BR

> LogCleanerManager endless loop while compacting/cleaning segments
> -
>
> Key: KAFKA-8764
> URL: https://issues.apache.org/jira/browse/KAFKA-8764
> Project: Kafka
>  Issue Type: Bug
>  Components: log cleaner
>Affects Versions: 2.3.0, 2.2.1
> Environment: docker base image: openjdk:8-jre-alpine base image, 
> kafka from http://ftp.carnet.hr/misc/apache/kafka/2.2.1/kafka_2.12-2.2.1.tgz
>Reporter: Tomislav Rajakovic
>Priority: Major
> Attachments: log-cleaner-bug-reproduction.zip
>
>
> {{LogCleanerManager stuck in endless loop while clearing segments for one 
> partition resulting with many log outputs and heavy disk read/writes/IOPS.}}
>  
> Issue appeared on follower brokers, and it happens on every (new) broker if 
> partition assignment is changed.
>  
> Original issue setup:
>  * kafka_2.12-2.2.1 deployed as statefulset on kubernetes, 5 brokers
>  * log directory is (AWS) EBS mounted PV, gp2 (ssd) kind of 750GiB
>  * 5 zookeepers
>  * topic created with config:
>  ** name = "backup_br_domain_squad"
> partitions = 36
> replication_factor = 3
> config = {
>  "cleanup.policy" = "compact"
>  "min.compaction.lag.ms" = "8640"
>  "min.cleanable.dirty.ratio" = "0.3"
> }
>  
>  
> Log excerpt:
> {{[2019-08-07 12:10:53,895] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:53,895] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,896] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,896] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,031] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,101] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:54,101] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,101] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,101] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,173] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:54,173] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,173] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> 

[jira] [Updated] (KAFKA-7519) Transactional Ids Left in Pending State by TransactionStateManager During Transactional Id Expiration Are Unusable

2020-01-10 Thread Alper Kanat (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alper Kanat updated KAFKA-7519:
---
Attachment: image-2020-01-10-12-37-28-804.png

> Transactional Ids Left in Pending State by TransactionStateManager During 
> Transactional Id Expiration Are Unusable
> --
>
> Key: KAFKA-7519
> URL: https://issues.apache.org/jira/browse/KAFKA-7519
> Project: Kafka
>  Issue Type: Bug
>  Components: core, producer 
>Affects Versions: 2.0.0
>Reporter: Bridger Howell
>Assignee: Bridger Howell
>Priority: Blocker
> Fix For: 2.0.1, 2.1.0
>
> Attachments: KAFKA-7519.patch, image-2018-10-18-13-02-22-371.png, 
> image-2020-01-10-12-37-28-804.png
>
>
>  
> After digging into a case where an exactly-once streams process was bizarrely 
> unable to process incoming data, we observed the following:
>  * StreamThreads stalling while creating a producer, eventually resulting in 
> no consumption by that streams process. Looking into those threads, we found 
> they were stuck in a loop, sending InitProducerIdRequests and always 
> receiving back the retriable error CONCURRENT_TRANSACTIONS and trying again. 
> These requests always had the same transactional id.
>  * After changing the streams process to not use exactly-once, it was able to 
> process messages with no problems.
>  * Alternatively, changing the applicationId for that streams process, it was 
> able to process with no problems.
>  * Every hour,  every broker would fail the task `transactionalId-expiration` 
> with the following error:
>  ** 
> {code:java}
> {"exception":{"stacktrace":"java.lang.IllegalStateException: Preparing 
> transaction state transition to Dead while it already a pending sta
> te Dead
>     at 
> kafka.coordinator.transaction.TransactionMetadata.prepareTransitionTo(TransactionMetadata.scala:262)
>     at kafka.coordinator
> .transaction.TransactionMetadata.prepareDead(TransactionMetadata.scala:237)
>     at kafka.coordinator.transaction.TransactionStateManager$$a
> nonfun$enableTransactionalIdExpiration$1$$anonfun$apply$mcV$sp$1$$anonfun$2$$anonfun$apply$9$$anonfun$3.apply(TransactionStateManager.scal
> a:151)
>     at 
> kafka.coordinator.transaction.TransactionStateManager$$anonfun$enableTransactionalIdExpiration$1$$anonfun$apply$mcV$sp$1$$ano
> nfun$2$$anonfun$apply$9$$anonfun$3.apply(TransactionStateManager.scala:151)
>     at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:251)
>     at
>  
> kafka.coordinator.transaction.TransactionMetadata.inLock(TransactionMetadata.scala:172)
>     at kafka.coordinator.transaction.TransactionSt
> ateManager$$anonfun$enableTransactionalIdExpiration$1$$anonfun$apply$mcV$sp$1$$anonfun$2$$anonfun$apply$9.apply(TransactionStateManager.sc
> ala:150)
>     at 
> kafka.coordinator.transaction.TransactionStateManager$$anonfun$enableTransactionalIdExpiration$1$$anonfun$apply$mcV$sp$1$$a
> nonfun$2$$anonfun$apply$9.apply(TransactionStateManager.scala:149)
>     at scala.collection.TraversableLike$$anonfun$map$1.apply(Traversable
> Like.scala:234)
>     at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>     at scala.collection.immutable.Li
> st.foreach(List.scala:392)
>     at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>     at scala.collection.immutable.Li
> st.map(List.scala:296)
>     at 
> kafka.coordinator.transaction.TransactionStateManager$$anonfun$enableTransactionalIdExpiration$1$$anonfun$app
> ly$mcV$sp$1$$anonfun$2.apply(TransactionStateManager.scala:149)
>     at kafka.coordinator.transaction.TransactionStateManager$$anonfun$enabl
> eTransactionalIdExpiration$1$$anonfun$apply$mcV$sp$1$$anonfun$2.apply(TransactionStateManager.scala:142)
>     at scala.collection.Traversabl
> eLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>     at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.
> scala:241)
>     at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130)
>     at scala.collection.mutable.HashMap$$anon
> fun$foreach$1.apply(HashMap.scala:130)
>     at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
>     at scala.collec
> tion.mutable.HashMap.foreachEntry(HashMap.scala:40)
>     at scala.collection.mutable.HashMap.foreach(HashMap.scala:130)
>     at scala.collecti
> on.TraversableLike$class.flatMap(TraversableLike.scala:241)
>     at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
>     a
> t 
> kafka.coordinator.transaction.TransactionStateManager$$anonfun$enableTransactionalIdExpiration$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(Tr
> ansactionStateManager.scala:142)
>     at 
> 

[jira] [Commented] (KAFKA-7519) Transactional Ids Left in Pending State by TransactionStateManager During Transactional Id Expiration Are Unusable

2020-01-10 Thread Alper Kanat (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-7519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17012625#comment-17012625
 ] 

Alper Kanat commented on KAFKA-7519:


I just had the exact error in Kafka 2.2.0 – any ideas? is this PR really merged 
into 2.0.1, 2.1.0 releases?

!image-2020-01-10-12-37-28-804.png!

> Transactional Ids Left in Pending State by TransactionStateManager During 
> Transactional Id Expiration Are Unusable
> --
>
> Key: KAFKA-7519
> URL: https://issues.apache.org/jira/browse/KAFKA-7519
> Project: Kafka
>  Issue Type: Bug
>  Components: core, producer 
>Affects Versions: 2.0.0
>Reporter: Bridger Howell
>Assignee: Bridger Howell
>Priority: Blocker
> Fix For: 2.0.1, 2.1.0
>
> Attachments: KAFKA-7519.patch, image-2018-10-18-13-02-22-371.png, 
> image-2020-01-10-12-37-28-804.png
>
>
>  
> After digging into a case where an exactly-once streams process was bizarrely 
> unable to process incoming data, we observed the following:
>  * StreamThreads stalling while creating a producer, eventually resulting in 
> no consumption by that streams process. Looking into those threads, we found 
> they were stuck in a loop, sending InitProducerIdRequests and always 
> receiving back the retriable error CONCURRENT_TRANSACTIONS and trying again. 
> These requests always had the same transactional id.
>  * After changing the streams process to not use exactly-once, it was able to 
> process messages with no problems.
>  * Alternatively, changing the applicationId for that streams process, it was 
> able to process with no problems.
>  * Every hour,  every broker would fail the task `transactionalId-expiration` 
> with the following error:
>  ** 
> {code:java}
> {"exception":{"stacktrace":"java.lang.IllegalStateException: Preparing 
> transaction state transition to Dead while it already a pending sta
> te Dead
>     at 
> kafka.coordinator.transaction.TransactionMetadata.prepareTransitionTo(TransactionMetadata.scala:262)
>     at kafka.coordinator
> .transaction.TransactionMetadata.prepareDead(TransactionMetadata.scala:237)
>     at kafka.coordinator.transaction.TransactionStateManager$$a
> nonfun$enableTransactionalIdExpiration$1$$anonfun$apply$mcV$sp$1$$anonfun$2$$anonfun$apply$9$$anonfun$3.apply(TransactionStateManager.scal
> a:151)
>     at 
> kafka.coordinator.transaction.TransactionStateManager$$anonfun$enableTransactionalIdExpiration$1$$anonfun$apply$mcV$sp$1$$ano
> nfun$2$$anonfun$apply$9$$anonfun$3.apply(TransactionStateManager.scala:151)
>     at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:251)
>     at
>  
> kafka.coordinator.transaction.TransactionMetadata.inLock(TransactionMetadata.scala:172)
>     at kafka.coordinator.transaction.TransactionSt
> ateManager$$anonfun$enableTransactionalIdExpiration$1$$anonfun$apply$mcV$sp$1$$anonfun$2$$anonfun$apply$9.apply(TransactionStateManager.sc
> ala:150)
>     at 
> kafka.coordinator.transaction.TransactionStateManager$$anonfun$enableTransactionalIdExpiration$1$$anonfun$apply$mcV$sp$1$$a
> nonfun$2$$anonfun$apply$9.apply(TransactionStateManager.scala:149)
>     at scala.collection.TraversableLike$$anonfun$map$1.apply(Traversable
> Like.scala:234)
>     at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>     at scala.collection.immutable.Li
> st.foreach(List.scala:392)
>     at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>     at scala.collection.immutable.Li
> st.map(List.scala:296)
>     at 
> kafka.coordinator.transaction.TransactionStateManager$$anonfun$enableTransactionalIdExpiration$1$$anonfun$app
> ly$mcV$sp$1$$anonfun$2.apply(TransactionStateManager.scala:149)
>     at kafka.coordinator.transaction.TransactionStateManager$$anonfun$enabl
> eTransactionalIdExpiration$1$$anonfun$apply$mcV$sp$1$$anonfun$2.apply(TransactionStateManager.scala:142)
>     at scala.collection.Traversabl
> eLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>     at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.
> scala:241)
>     at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130)
>     at scala.collection.mutable.HashMap$$anon
> fun$foreach$1.apply(HashMap.scala:130)
>     at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
>     at scala.collec
> tion.mutable.HashMap.foreachEntry(HashMap.scala:40)
>     at scala.collection.mutable.HashMap.foreach(HashMap.scala:130)
>     at scala.collecti
> on.TraversableLike$class.flatMap(TraversableLike.scala:241)
>     at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
>     a
> t 
> 

[jira] [Commented] (KAFKA-9152) Improve Sensor Retrieval

2020-01-10 Thread highluck (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17012572#comment-17012572
 ] 

highluck commented on KAFKA-9152:
-

[~cadonna] 
[https://github.com/apache/kafka/pull/7914]

I don't know if I understand

Please confirm pull

> Improve Sensor Retrieval 
> -
>
> Key: KAFKA-9152
> URL: https://issues.apache.org/jira/browse/KAFKA-9152
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Bruno Cadonna
>Assignee: highluck
>Priority: Minor
>  Labels: newbie, tech-debt
>
> This ticket shall improve two aspects of the retrieval of sensors:
> 1. Currently, when a sensor is retrieved with {{*Metrics.*Sensor()}} (e.g. 
> {{ThreadMetrics.createTaskSensor()}}) after it was created with the same 
> method {{*Metrics.*Sensor()}}, the sensor is added again to the corresponding 
> queue in {{*Sensors}} (e.g. {{threadLevelSensors}}) in 
> {{StreamsMetricsImpl}}. Those queues are used to remove the sensors when 
> {{removeAll*LevelSensors()}} is called. Having multiple times the same 
> sensors in this queue is not an issue from a correctness point of view. 
> However, it would reduce the footprint to only store a sensor once in those 
> queues.
> 2. When a sensor is retrieved, the current code attempts to create a new 
> sensor and to add to it again the corresponding metrics. This could be 
> avoided.
>  
> Both aspects could be improved by checking whether a sensor already exists by 
> calling {{getSensor()}} on the {{Metrics}} object and checking the return 
> value.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (KAFKA-9152) Improve Sensor Retrieval

2020-01-10 Thread highluck (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17012572#comment-17012572
 ] 

highluck edited comment on KAFKA-9152 at 1/10/20 8:40 AM:
--

[~cadonna]


 [https://github.com/apache/kafka/pull/7914]

I don't know if I understand

Please confirm pull


was (Author: high.lee):
[~cadonna] 
[https://github.com/apache/kafka/pull/7914]

I don't know if I understand

Please confirm pull

> Improve Sensor Retrieval 
> -
>
> Key: KAFKA-9152
> URL: https://issues.apache.org/jira/browse/KAFKA-9152
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Bruno Cadonna
>Assignee: highluck
>Priority: Minor
>  Labels: newbie, tech-debt
>
> This ticket shall improve two aspects of the retrieval of sensors:
> 1. Currently, when a sensor is retrieved with {{*Metrics.*Sensor()}} (e.g. 
> {{ThreadMetrics.createTaskSensor()}}) after it was created with the same 
> method {{*Metrics.*Sensor()}}, the sensor is added again to the corresponding 
> queue in {{*Sensors}} (e.g. {{threadLevelSensors}}) in 
> {{StreamsMetricsImpl}}. Those queues are used to remove the sensors when 
> {{removeAll*LevelSensors()}} is called. Having multiple times the same 
> sensors in this queue is not an issue from a correctness point of view. 
> However, it would reduce the footprint to only store a sensor once in those 
> queues.
> 2. When a sensor is retrieved, the current code attempts to create a new 
> sensor and to add to it again the corresponding metrics. This could be 
> avoided.
>  
> Both aspects could be improved by checking whether a sensor already exists by 
> calling {{getSensor()}} on the {{Metrics}} object and checking the return 
> value.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)