[jira] [Commented] (KAFKA-7484) Fix test SuppressionDurabilityIntegrationTest.shouldRecoverBufferAfterShutdown()

2018-10-05 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16639350#comment-16639350
 ] 

Dong Lin commented on KAFKA-7484:
-

The test is newly added in [https://github.com/apache/kafka/pull/5724]

[~vvcephei] would you have time to take a look? Thanks!

> Fix test 
> SuppressionDurabilityIntegrationTest.shouldRecoverBufferAfterShutdown()
> 
>
> Key: KAFKA-7484
> URL: https://issues.apache.org/jira/browse/KAFKA-7484
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Dong Lin
>Priority: Major
>
> The test 
> SuppressionDurabilityIntegrationTest.shouldRecoverBufferAfterShutdown() fails 
> in the 2.1.0 branch Jekin job. See 
> [https://builds.apache.org/job/kafka-2.1-jdk8/1/testReport/junit/org.apache.kafka.streams.integration/SuppressionDurabilityIntegrationTest/shouldRecoverBufferAfterShutdown_1__eosEnabled_true_/.|https://builds.apache.org/job/kafka-2.1-jdk8/1/testReport/junit/org.apache.kafka.streams.integration/SuppressionDurabilityIntegrationTest/shouldRecoverBufferAfterShutdown_1__eosEnabled_true_/]
> Here is the stack trace: 
> java.lang.AssertionError: Condition not met within timeout 3. Did not 
> receive all 3 records from topic output-raw-shouldRecoverBufferAfterShutdown 
> at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:278) at 
> org.apache.kafka.streams.integration.utils.IntegrationTestUtils.waitUntilMinRecordsReceived(IntegrationTestUtils.java:462)
>  at 
> org.apache.kafka.streams.integration.utils.IntegrationTestUtils.waitUntilMinRecordsReceived(IntegrationTestUtils.java:343)
>  at 
> org.apache.kafka.streams.integration.utils.IntegrationTestUtils.verifyKeyValueTimestamps(IntegrationTestUtils.java:543)
>  at 
> org.apache.kafka.streams.integration.SuppressionDurabilityIntegrationTest.verifyOutput(SuppressionDurabilityIntegrationTest.java:239)
>  at 
> org.apache.kafka.streams.integration.SuppressionDurabilityIntegrationTest.shouldRecoverBufferAfterShutdown(SuppressionDurabilityIntegrationTest.java:206)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>  at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>  at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:363) at 
> org.junit.runners.Suite.runChild(Suite.java:128) at 
> org.junit.runners.Suite.runChild(Suite.java:27) at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at 
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) at 
> org.junit.rules.RunRules.evaluate(RunRules.java:20) at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:363) at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.runTestClass(JUnitTestClassExecutor.java:106)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KAFKA-7484) Fix test SuppressionDurabilityIntegrationTest.shouldRecoverBufferAfterShutdown()

2018-10-04 Thread Dong Lin (JIRA)
Dong Lin created KAFKA-7484:
---

 Summary: Fix test 
SuppressionDurabilityIntegrationTest.shouldRecoverBufferAfterShutdown()
 Key: KAFKA-7484
 URL: https://issues.apache.org/jira/browse/KAFKA-7484
 Project: Kafka
  Issue Type: Improvement
Reporter: Dong Lin


The test 
SuppressionDurabilityIntegrationTest.shouldRecoverBufferAfterShutdown() fails 
in the 2.1.0 branch Jekin job. See 
[https://builds.apache.org/job/kafka-2.1-jdk8/1/testReport/junit/org.apache.kafka.streams.integration/SuppressionDurabilityIntegrationTest/shouldRecoverBufferAfterShutdown_1__eosEnabled_true_/.|https://builds.apache.org/job/kafka-2.1-jdk8/1/testReport/junit/org.apache.kafka.streams.integration/SuppressionDurabilityIntegrationTest/shouldRecoverBufferAfterShutdown_1__eosEnabled_true_/]

Here is the stack trace: 

java.lang.AssertionError: Condition not met within timeout 3. Did not 
receive all 3 records from topic output-raw-shouldRecoverBufferAfterShutdown at 
org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:278) at 
org.apache.kafka.streams.integration.utils.IntegrationTestUtils.waitUntilMinRecordsReceived(IntegrationTestUtils.java:462)
 at 
org.apache.kafka.streams.integration.utils.IntegrationTestUtils.waitUntilMinRecordsReceived(IntegrationTestUtils.java:343)
 at 
org.apache.kafka.streams.integration.utils.IntegrationTestUtils.verifyKeyValueTimestamps(IntegrationTestUtils.java:543)
 at 
org.apache.kafka.streams.integration.SuppressionDurabilityIntegrationTest.verifyOutput(SuppressionDurabilityIntegrationTest.java:239)
 at 
org.apache.kafka.streams.integration.SuppressionDurabilityIntegrationTest.shouldRecoverBufferAfterShutdown(SuppressionDurabilityIntegrationTest.java:206)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
 at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
 at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
 at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
 at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
 at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
 at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at 
org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at 
org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at 
org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at 
org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at 
org.junit.runners.ParentRunner.run(ParentRunner.java:363) at 
org.junit.runners.Suite.runChild(Suite.java:128) at 
org.junit.runners.Suite.runChild(Suite.java:27) at 
org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at 
org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at 
org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at 
org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at 
org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at 
org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) at 
org.junit.rules.RunRules.evaluate(RunRules.java:20) at 
org.junit.runners.ParentRunner.run(ParentRunner.java:363) at 
org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.runTestClass(JUnitTestClassExecutor.java:106)

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KAFKA-7441) Allow LogCleanerManager.resumeCleaning() to be used concurrently

2018-10-04 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin resolved KAFKA-7441.
-
Resolution: Fixed

> Allow LogCleanerManager.resumeCleaning() to be used concurrently
> 
>
> Key: KAFKA-7441
> URL: https://issues.apache.org/jira/browse/KAFKA-7441
> Project: Kafka
>  Issue Type: Improvement
>Reporter: xiongqi wu
>Assignee: xiongqi wu
>Priority: Blocker
> Fix For: 2.1.0
>
>
> LogCleanerManger provides APIs abortAndPauseCleaning(TopicPartition) and 
> resumeCleaning(Iterable[TopicPartition]). The abortAndPauseCleaning(...) will 
> do nothing if the partition is already in paused state. And 
> resumeCleaning(..) will always clear the state for the partition if the 
> partition is in paused state. Also, resumeCleaning(...) will throw 
> IllegalStateException if the partition does not have any state (e.g. its 
> state is cleared).
>  
> This will cause problem in the following scenario:
> 1) Background thread invokes LogManager.cleanupLogs() which in turn does  
> abortAndPauseCleaning(...) for a given partition. Now this partition is in 
> paused state.
> 2) User requests deletion for this partition. Controller sends 
> StopReplicaRequest with delete=true for this partition. RequestHanderThread 
> calls abortAndPauseCleaning(...) followed by resumeCleaning(...) for the same 
> partition. Now there is no state for this partition.
> 3) Background thread invokes resumeCleaning(...) as part of 
> LogManager.cleanupLogs(). Because there is no state for this partition, it 
> causes IllegalStateException.
>  
> This issue can also happen before KAFKA-7322 if unclean leader election 
> triggers log truncation for a partition at the same time that the partition 
> is deleted upon user request. But unclean leader election is very rare. The 
> fix made in https://issues.apache.org/jira/browse/KAFKA-7322 makes this issue 
> much more frequent.
> The solution is to record the number of pauses.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7196) Remove heartbeat delayed operation for those removed consumers at the end of each rebalance

2018-10-04 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7196:

Fix Version/s: (was: 2.2.0)
   2.1.0
   2.0.1
   1.1.2
   1.0.3

> Remove heartbeat delayed operation for those removed consumers at the end of 
> each rebalance
> ---
>
> Key: KAFKA-7196
> URL: https://issues.apache.org/jira/browse/KAFKA-7196
> Project: Kafka
>  Issue Type: Bug
>  Components: core, purgatory
>Reporter: Lincong Li
>Assignee: Lincong Li
>Priority: Minor
> Fix For: 1.0.3, 1.1.2, 2.0.1, 2.1.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> During the consumer group rebalance, when the joining group phase finishes, 
> the heartbeat delayed operation of the consumer that fails to rejoin the 
> group should be removed from the purgatory. Otherwise, even though the member 
> ID of the consumer has been removed from the group, its heartbeat delayed 
> operation is still registered in the purgatory and the heartbeat delayed 
> operation is going to timeout and then another unnecessary rebalance is 
> triggered because of it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (KAFKA-7475) print the actual cluster bootstrap address on authentication failures

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin reassigned KAFKA-7475:
---

Assignee: radai rosenblatt

> print the actual cluster bootstrap address on authentication failures
> -
>
> Key: KAFKA-7475
> URL: https://issues.apache.org/jira/browse/KAFKA-7475
> Project: Kafka
>  Issue Type: Improvement
>Reporter: radai rosenblatt
>Assignee: radai rosenblatt
>Priority: Major
>
> currently when a kafka client fails to connect to a cluster, users see 
> something like this:
> {code}
> Connection to node -1 terminated during authentication. This may indicate 
> that authentication failed due to invalid credentials. 
> {code}
> that log line is mostly useless in identifying which (of potentially many) 
> kafka client is having issues and what kafka cluster is it having issues with.
> would be nice to record the remote host/port



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-5276) Support derived and prefixed configs in DescribeConfigs (KIP-133)

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-5276:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Support derived and prefixed configs in DescribeConfigs (KIP-133)
> -
>
> Key: KAFKA-5276
> URL: https://issues.apache.org/jira/browse/KAFKA-5276
> Project: Kafka
>  Issue Type: Sub-task
>  Components: config
>Reporter: Ismael Juma
>Priority: Major
> Fix For: 2.2.0
>
>
> The broker supports config overrides per listener. The way we do that is by 
> prefixing the configs with the listener name. These configs are not defined 
> by ConfigDef and they don't appear in `values()`. They do appear in 
> `originals()`. We should change the code so that we return these configs. 
> Because these configs are read-only, nothing needs to be done for 
> AlterConfigs.
> With regards to derived configs, an example is advertised.listeners, which 
> falls back to listeners. This is currently done outside AbstractConfig. We 
> should look into including these into AbstractConfig so that the fallback 
> happens for the returned configs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7149) Reduce assignment data size to improve kafka streams scalability

2018-10-03 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16636555#comment-16636555
 ] 

Dong Lin commented on KAFKA-7149:
-

Added Navinder to contributor list and assigned Jira to Navinder.

> Reduce assignment data size to improve kafka streams scalability
> 
>
> Key: KAFKA-7149
> URL: https://issues.apache.org/jira/browse/KAFKA-7149
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Affects Versions: 2.0.0
>Reporter: Ashish Surana
>Assignee: Navinder Brar
>Priority: Major
> Fix For: 2.1.0
>
>
> We observed that when we have high number of partitions, instances or 
> stream-threads, assignment-data size grows too fast and we start getting 
> below RecordTooLargeException at kafka-broker.
> Workaround of this issue is commented at: 
> https://issues.apache.org/jira/browse/KAFKA-6976
> Still it limits the scalability of kafka streams as moving around 100MBs of 
> assignment data for each rebalancing affects performance & reliability 
> (timeout exceptions starts appearing) as well. Also this limits kafka streams 
> scale even with high max.message.bytes setting as data size increases pretty 
> quickly with number of partitions, instances or stream-threads.
>  
> Solution:
> To address this issue in our cluster, we are sending the compressed 
> assignment-data. We saw assignment-data size reduced by 8X-10X. This improved 
> the kafka streams scalability drastically for us and we could now run it with 
> more than 8,000 partitions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (KAFKA-7149) Reduce assignment data size to improve kafka streams scalability

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin reassigned KAFKA-7149:
---

Assignee: Navinder Brar  (was: Ashish Surana)

> Reduce assignment data size to improve kafka streams scalability
> 
>
> Key: KAFKA-7149
> URL: https://issues.apache.org/jira/browse/KAFKA-7149
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Affects Versions: 2.0.0
>Reporter: Ashish Surana
>Assignee: Navinder Brar
>Priority: Major
> Fix For: 2.1.0
>
>
> We observed that when we have high number of partitions, instances or 
> stream-threads, assignment-data size grows too fast and we start getting 
> below RecordTooLargeException at kafka-broker.
> Workaround of this issue is commented at: 
> https://issues.apache.org/jira/browse/KAFKA-6976
> Still it limits the scalability of kafka streams as moving around 100MBs of 
> assignment data for each rebalancing affects performance & reliability 
> (timeout exceptions starts appearing) as well. Also this limits kafka streams 
> scale even with high max.message.bytes setting as data size increases pretty 
> quickly with number of partitions, instances or stream-threads.
>  
> Solution:
> To address this issue in our cluster, we are sending the compressed 
> assignment-data. We saw assignment-data size reduced by 8X-10X. This improved 
> the kafka streams scalability drastically for us and we could now run it with 
> more than 8,000 partitions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-4203) Java producer default max message size does not align with broker default

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-4203:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Java producer default max message size does not align with broker default
> -
>
> Key: KAFKA-4203
> URL: https://issues.apache.org/jira/browse/KAFKA-4203
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.1
>Reporter: Grant Henke
>Assignee: Ismael Juma
>Priority: Major
> Fix For: 2.2.0
>
>
> The Java producer sets max.request.size = 1048576 (the base 2 version of 1 MB 
> (MiB))
> The broker sets max.message.bytes = 112 (the base 10 value of 1 MB + 12 
> bytes for overhead)
> This means that by default the producer can try to produce messages larger 
> than the broker will accept resulting in RecordTooLargeExceptions.
> There were not similar issues in the old producer because it sets 
> max.message.size = 100 (the base 10 value of 1 MB)
> I propose we increase the broker default for max.message.bytes to 1048588 
> (the base 2 value of 1 MB (MiB) + 12 bytes for overhead) so that any message 
> produced with default configs from either producer does not result in a 
> RecordTooLargeException.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-3554) Generate actual data with specific compression ratio and add multi-thread support in the ProducerPerformance tool.

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-3554:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Generate actual data with specific compression ratio and add multi-thread 
> support in the ProducerPerformance tool.
> --
>
> Key: KAFKA-3554
> URL: https://issues.apache.org/jira/browse/KAFKA-3554
> Project: Kafka
>  Issue Type: Improvement
>Affects Versions: 0.9.0.1
>Reporter: Jiangjie Qin
>Assignee: Jiangjie Qin
>Priority: Major
> Fix For: 2.2.0
>
>
> Currently the ProducerPerformance always generate the payload with same 
> bytes. This does not quite well to test the compressed data because the 
> payload is extremely compressible no matter how big the payload is.
> We can make some changes to make it more useful for compressed messages. 
> Currently I am generating the payload containing integer from a given range. 
> By adjusting the range of the integers, we can get different compression 
> ratios. 
> API wise, we can either let user to specify the integer range or the expected 
> compression ratio (we will do some probing to get the corresponding range for 
> the users)
> Besides that, in many cases, it is useful to have multiple producer threads 
> when the producer threads themselves are bottleneck. Admittedly people can 
> run multiple ProducerPerformance to achieve similar result, but it is still 
> different from the real case when people actually use the producer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-4249) Document how to customize GC logging options for broker

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-4249:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Document how to customize GC logging options for broker
> ---
>
> Key: KAFKA-4249
> URL: https://issues.apache.org/jira/browse/KAFKA-4249
> Project: Kafka
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 0.10.0.1
>Reporter: Jim Hoagland
>Assignee: Tom Bentley
>Priority: Major
> Fix For: 2.2.0
>
>
> We wanted to enable GC logging for Kafka broker and saw that you can set 
> GC_LOG_ENABLED=true.  However, this didn't do what we wanted.  For example, 
> the GC log will be overwritten every time the broker gets restarted.  It 
> wasn't clear how we could do that (no documentation of it that I can find), 
> so I did some research by looking at the source code and did some testing and 
> found that KAFKA_GC_LOG_OPTS could be set with alternate JVM options prior to 
> starting broker.  I posted my solution to StackOverflow:
>   
> http://stackoverflow.com/questions/39854424/how-to-enable-gc-logging-for-apache-kafka-brokers-while-preventing-log-file-ove
> (feel free to critique)
> That solution is now public, but it seems like the Kafka documentation should 
> say how to do this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-3042) updateIsr should stop after failed several times due to zkVersion issue

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-3042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-3042:

Fix Version/s: (was: 2.1.0)
   2.2.0

> updateIsr should stop after failed several times due to zkVersion issue
> ---
>
> Key: KAFKA-3042
> URL: https://issues.apache.org/jira/browse/KAFKA-3042
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 0.10.0.0
> Environment: jdk 1.7
> centos 6.4
>Reporter: Jiahongchao
>Assignee: Dong Lin
>Priority: Major
>  Labels: reliability
> Fix For: 2.2.0
>
> Attachments: controller.log, server.log.2016-03-23-01, 
> state-change.log
>
>
> sometimes one broker may repeatly log
> "Cached zkVersion 54 not equal to that in zookeeper, skip updating ISR"
> I think this is because the broker consider itself as the leader in fact it's 
> a follower.
> So after several failed tries, it need to find out who is the leader



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-5792) Transient failure in KafkaAdminClientTest.testHandleTimeout

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-5792:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Transient failure in KafkaAdminClientTest.testHandleTimeout
> ---
>
> Key: KAFKA-5792
> URL: https://issues.apache.org/jira/browse/KAFKA-5792
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>Assignee: Colin P. McCabe
>Priority: Major
>  Labels: transient-unit-test-failure
> Fix For: 2.2.0
>
>
> The {{KafkaAdminClientTest.testHandleTimeout}} test occasionally fails with 
> the following:
> {noformat}
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node 
> assignment.
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45)
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32)
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:89)
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:213)
>   at 
> org.apache.kafka.clients.admin.KafkaAdminClientTest.testHandleTimeout(KafkaAdminClientTest.java:356)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting 
> for a node assignment.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-6774) Improve default groupId behavior in consumer

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-6774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-6774:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Improve default groupId behavior in consumer
> 
>
> Key: KAFKA-6774
> URL: https://issues.apache.org/jira/browse/KAFKA-6774
> Project: Kafka
>  Issue Type: Improvement
>  Components: consumer
>Reporter: Jason Gustafson
>Assignee: Vahid Hashemian
>Priority: Major
>  Labels: needs-kip
> Fix For: 2.2.0
>
>
> At the moment, the default groupId in the consumer is "". If you try to use 
> this to subscribe() to a topic, the broker will reject the group as invalid. 
> On the other hand, if you use it with assign(), then the user will be able to 
> fetch and commit offsets using the empty groupId. Probably 99% of the time, 
> this is not what the user expects. Instead you would probably expect that if 
> no groupId is provided, then no committed offsets will be fetched at all and 
> we'll just use the auto reset behavior if we don't have a current position.
> Here are two potential solutions (both requiring a KIP):
> 1. Change the default to null. We will preserve the current behavior for 
> subscribe(). When using assign(), we will not bother fetching committed 
> offsets for the null groupId, and any attempt to commit offsets will raise an 
> error. The user can still use the empty groupId, but they have to specify it 
> explicitly.
> 2. Keep the current default, but change the consumer to treat this value as 
> if it were null as described in option 1. The argument for this behavior is 
> that using the empty groupId to commit offsets is inherently a dangerous 
> practice and should not be permitted. We'd have to convince ourselves that 
> we're fine not needing to allow the empty groupId for backwards compatibility 
> though.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-5722) Refactor ConfigCommand to use the AdminClient

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-5722:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Refactor ConfigCommand to use the AdminClient
> -
>
> Key: KAFKA-5722
> URL: https://issues.apache.org/jira/browse/KAFKA-5722
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Viktor Somogyi
>Assignee: Viktor Somogyi
>Priority: Major
>  Labels: kip, needs-kip
> Fix For: 2.2.0
>
>
> The ConfigCommand currently uses a direct connection to zookeeper. The 
> zookeeper dependency should be deprecated and an AdminClient API created to 
> be used instead.
> This change requires a KIP.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-5272) Improve validation for Alter Configs (KIP-133)

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-5272:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Improve validation for Alter Configs (KIP-133)
> --
>
> Key: KAFKA-5272
> URL: https://issues.apache.org/jira/browse/KAFKA-5272
> Project: Kafka
>  Issue Type: Sub-task
>  Components: config
>Reporter: Ismael Juma
>Priority: Major
> Fix For: 2.2.0
>
>
> TopicConfigHandler.processConfigChanges() warns about certain topic configs. 
> We should include such validations in alter configs and reject the change if 
> the validation fails. Note that this should be done without changing the 
> behaviour of the ConfigCommand (as it does not have access to the broker 
> configs).
> We should consider adding other validations like KAFKA-4092 and KAFKA-4680.
> Finally, the AlterConfigsPolicy mentioned in KIP-133 will be added at the 
> same time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7026) Sticky assignor could assign a partition to multiple consumers (KIP-341)

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7026:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Sticky assignor could assign a partition to multiple consumers (KIP-341)
> 
>
> Key: KAFKA-7026
> URL: https://issues.apache.org/jira/browse/KAFKA-7026
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Reporter: Vahid Hashemian
>Assignee: Vahid Hashemian
>Priority: Major
>  Labels: kip
> Fix For: 2.2.0
>
>
> In the following scenario sticky assignor assigns a topic partition to two 
> consumers in the group:
>  # Create a topic {{test}} with a single partition
>  # Start consumer {{c1}} in group {{sticky-group}} ({{c1}} becomes group 
> leader and gets {{test-0}})
>  # Start consumer {{c2}}  in group {{sticky-group}} ({{c1}} holds onto 
> {{test-0}}, {{c2}} does not get any partition) 
>  # Pause {{c1}} (e.g. using Java debugger) ({{c2}} becomes leader and takes 
> over {{test-0}}, {{c1}} leaves the group)
>  # Resume {{c1}}
> At this point both {{c1}} and {{c2}} will have {{test-0}} assigned to them.
>  
> The reason is {{c1}} still has kept its previous assignment ({{test-0}}) from 
> the last assignment it received from the leader (itself) and did not get the 
> next round of assignments (when {{c2}} became leader) because it was paused. 
> Both {{c1}} and {{c2}} enter the rebalance supplying {{test-0}} as their 
> existing assignment. The sticky assignor code does not currently check and 
> avoid this duplication.
>  
> Note: This issue was originally reported on 
> [StackOverflow|https://stackoverflow.com/questions/50761842/kafka-stickyassignor-breaking-delivery-to-single-consumer-in-the-group].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-3096) Leader is not set to -1 when it is shutdown if followers are down

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-3096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-3096:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Leader is not set to -1 when it is shutdown if followers are down
> -
>
> Key: KAFKA-3096
> URL: https://issues.apache.org/jira/browse/KAFKA-3096
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.9.0.0
>Reporter: Ismael Juma
>Assignee: Ismael Juma
>Priority: Major
>  Labels: reliability
> Fix For: 2.2.0
>
>
> Assuming a cluster with 2 brokers with unclear leader election disabled:
> 1. Start brokers 0 and 1
> 2. Perform partition assignment
> 3. Broker 0 is elected leader
> 4. Produce message and wait until metadata is propagated
> 6. Shutdown follower
> 7. Produce message
> 8. Shutdown leader
> 9. Start follower
> 10. Wait for leader election
> Expected: leader is -1
> Actual: leader is 0
> We have a test for this, but a bug in `waitUntilLeaderIsElectedOrChanged` 
> means that `newLeaderOpt` is not being checked.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-5637) Document compatibility and release policies

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-5637:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Document compatibility and release policies
> ---
>
> Key: KAFKA-5637
> URL: https://issues.apache.org/jira/browse/KAFKA-5637
> Project: Kafka
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Ismael Juma
>Assignee: Sönke Liebau
>Priority: Major
> Fix For: 2.2.0
>
>
> We should document our compatibility and release policies in one place so 
> that people have the correct expectations. This is generally important, but 
> more so now that we are releasing 1.0.0.
> I extracted the following topics from the mailing list thread as the ones 
> that should be documented as a minimum: 
> *Code stability*
> * Explanation of stability annotations and their implications
> * Explanation of what public apis are
> * *Discussion point: * Do we want to keep the _unstable_ annotation or is 
> _evolving_ sufficient going forward?
> *Support duration*
> * How long are versions supported?
> * How far are bugfixes backported?
> * How far are security fixes backported?
> * How long are protocol versions supported by subsequent code versions?
> * How long are older clients supported?
> * How long are older brokers supported?
> I will create an initial pull request to add a section to the documentation 
> as basis for further discussion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-5029) cleanup javadocs and logging

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-5029:

Fix Version/s: (was: 2.1.0)
   2.2.0

> cleanup javadocs and logging
> 
>
> Key: KAFKA-5029
> URL: https://issues.apache.org/jira/browse/KAFKA-5029
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Onur Karaman
>Assignee: Ismael Juma
>Priority: Major
> Fix For: 2.2.0
>
>
> Remove state change logger, splitting it up into the controller logs or 
> broker logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-3866) KerberosLogin refresh time bug and other improvements

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-3866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-3866:

Fix Version/s: (was: 2.1.0)
   2.2.0

> KerberosLogin refresh time bug and other improvements
> -
>
> Key: KAFKA-3866
> URL: https://issues.apache.org/jira/browse/KAFKA-3866
> Project: Kafka
>  Issue Type: Bug
>  Components: security
>Affects Versions: 0.10.0.0
>Reporter: Ismael Juma
>Assignee: Ismael Juma
>Priority: Major
> Fix For: 2.2.0
>
>
> ZOOKEEPER-2295 describes a bug in the Kerberos refresh time logic that is 
> also present in our KerberosLogin class. While looking at the code, I found a 
> number of things that could be improved. More details in the PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-5876) IQ should throw different exceptions for different errors

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-5876:

Fix Version/s: (was: 2.1.0)
   2.2.0

> IQ should throw different exceptions for different errors
> -
>
> Key: KAFKA-5876
> URL: https://issues.apache.org/jira/browse/KAFKA-5876
> Project: Kafka
>  Issue Type: Task
>  Components: streams
>Reporter: Matthias J. Sax
>Assignee: Vito Jeng
>Priority: Major
>  Labels: needs-kip, newbie++
> Fix For: 2.2.0
>
>
> Currently, IQ does only throws {{InvalidStateStoreException}} for all errors 
> that occur. However, we have different types of errors and should throw 
> different exceptions for those types.
> For example, if a store was migrated it must be rediscovered while if a store 
> cannot be queried yet, because it is still re-created after a rebalance, the 
> user just needs to wait until store recreation is finished.
> There might be other examples, too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-5736) Improve error message in Connect when all kafka brokers are down

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-5736:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Improve error message in Connect when all kafka brokers are down
> 
>
> Key: KAFKA-5736
> URL: https://issues.apache.org/jira/browse/KAFKA-5736
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
>Affects Versions: 0.11.0.0
>Reporter: Konstantine Karantasis
>Assignee: Konstantine Karantasis
>Priority: Major
> Fix For: 2.2.0
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Currently when all the Kafka brokers are down, Kafka Connect is failing with 
> a pretty unintuitive message when it tries to, for instance, reconfigure 
> tasks. 
> Example output: 
> {code:java}
> [2017-08-15 19:12:09,959] ERROR Failed to reconfigure connector's tasks, 
> retrying after backoff: 
> (org.apache.kafka.connect.runtime.distributed.DistributedHerder)
> java.lang.IllegalArgumentException: CircularIterator can only be used on 
> non-empty lists
> at 
> org.apache.kafka.common.utils.CircularIterator.(CircularIterator.java:29)
> at 
> org.apache.kafka.clients.consumer.RoundRobinAssignor.assign(RoundRobinAssignor.java:61)
> at 
> org.apache.kafka.clients.consumer.internals.AbstractPartitionAssignor.assign(AbstractPartitionAssignor.java:68)
> at 
> ... (connector code)
> at 
> org.apache.kafka.connect.runtime.Worker.connectorTaskConfigs(Worker.java:230)
> {code}
> The error message needs to be improved, since its root cause is the absence 
> kafka brokers for assignment. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7462) Kafka brokers cannot provide OAuth without a token

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7462:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Kafka brokers cannot provide OAuth without a token
> --
>
> Key: KAFKA-7462
> URL: https://issues.apache.org/jira/browse/KAFKA-7462
> Project: Kafka
>  Issue Type: Bug
>  Components: security
>Affects Versions: 2.0.0
>Reporter: Rajini Sivaram
>Priority: Major
> Fix For: 2.2.0
>
>
> Like with all other SASL mechanisms, OAUTHBEARER uses the same LoginModule 
> class on both  server-side and the client-side. But unlike PLAIN or SCRAM 
> where client credentials are optional, OAUTHBEARER requires always requires a 
> token. So while with PLAIN/SCRAM, broker only needs to specify client 
> credentials if the mechanism is used for inter-broker communication, with 
> OAuth, broker requires client credentials even if OAuth is not used for 
> inter-broker communication. This is an issue with the default 
> `OAuthBearerUnsecuredLoginCallbackHandler` used on both client-side and 
> server-side. But more critically, it is an issue with 
> `OAuthBearerLoginModule` which doesn't commit if token == null (commit() 
> returns false).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7389) Upgrade spotBugs for Java 11 support

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7389:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Upgrade spotBugs for Java 11 support
> 
>
> Key: KAFKA-7389
> URL: https://issues.apache.org/jira/browse/KAFKA-7389
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Ismael Juma
>Assignee: Ismael Juma
>Priority: Major
> Fix For: 2.2.0
>
>
> KAFKA-5887 replaces findBugs with spotBugs adding support for Java 9 and 10. 
> However, Java 11 is not supported in spotbugs 3.1.5.
> Once this is fixed, we also need to update the build to enable spotBugs when 
> executed with Java 11 and we need to update the relevant Jenkins jobs to 
> execute spotBugs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KAFKA-7096) Consumer should drop the data for unassigned topic partitions

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin resolved KAFKA-7096.
-
Resolution: Fixed

> Consumer should drop the data for unassigned topic partitions
> -
>
> Key: KAFKA-7096
> URL: https://issues.apache.org/jira/browse/KAFKA-7096
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Mayuresh Gharat
>Assignee: Mayuresh Gharat
>Priority: Major
> Fix For: 2.2.0
>
>
> currently if a client has assigned topics : T1, T2, T3 and calls poll(), the 
> poll might fetch data for partitions for all 3 topics T1, T2, T3. Now if the 
> client unassigns some topics (for example T3) and calls poll() we still hold 
> the data (for T3) in the completedFetches queue until we actually reach the 
> buffered data for the unassigned Topics (T3 in our example) on subsequent 
> poll() calls, at which point we drop that data. This process of holding the 
> data is unnecessary.
> When a client creates a topic, it takes time for the broker to fetch ACLs for 
> the topic. But during this time, the client will issue fetchRequest for the 
> topic, it will get response for the partitions of this topic. The response 
> consist of TopicAuthorizationException for each of the partitions. This 
> response for each partition is wrapped with a completedFetch and added to the 
> completedFetches queue. Now when the client calls the next poll() it sees the 
> TopicAuthorizationException from the first buffered CompletedFetch. At this 
> point the client chooses to sleep for 1.5 min as a backoff (as per the 
> design), hoping that the Broker fetches the ACL from ACL store in the 
> meantime. Actually the Broker has already fetched the ACL by this time. When 
> the client calls poll() after the sleep, it again sees the 
> TopicAuthorizationException from the second completedFetch and it sleeps 
> again. So it takes (1.5 * 60 * partitions) seconds before the client can see 
> any data. With this patch, the client when it sees the first 
> TopicAuthorizationException, it can all assign(EmptySet), which will get rid 
> of the buffered completedFetches (those with TopicAuthorizationException) and 
> it can again call assign(TopicPartitions) before calling poll(). With this 
> patch we found that client was able to get the records as soon as the Broker 
> fetched the ACLs from ACL store.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7096) Consumer should drop the data for unassigned topic partitions

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7096:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Consumer should drop the data for unassigned topic partitions
> -
>
> Key: KAFKA-7096
> URL: https://issues.apache.org/jira/browse/KAFKA-7096
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Mayuresh Gharat
>Assignee: Mayuresh Gharat
>Priority: Major
> Fix For: 2.2.0
>
>
> currently if a client has assigned topics : T1, T2, T3 and calls poll(), the 
> poll might fetch data for partitions for all 3 topics T1, T2, T3. Now if the 
> client unassigns some topics (for example T3) and calls poll() we still hold 
> the data (for T3) in the completedFetches queue until we actually reach the 
> buffered data for the unassigned Topics (T3 in our example) on subsequent 
> poll() calls, at which point we drop that data. This process of holding the 
> data is unnecessary.
> When a client creates a topic, it takes time for the broker to fetch ACLs for 
> the topic. But during this time, the client will issue fetchRequest for the 
> topic, it will get response for the partitions of this topic. The response 
> consist of TopicAuthorizationException for each of the partitions. This 
> response for each partition is wrapped with a completedFetch and added to the 
> completedFetches queue. Now when the client calls the next poll() it sees the 
> TopicAuthorizationException from the first buffered CompletedFetch. At this 
> point the client chooses to sleep for 1.5 min as a backoff (as per the 
> design), hoping that the Broker fetches the ACL from ACL store in the 
> meantime. Actually the Broker has already fetched the ACL by this time. When 
> the client calls poll() after the sleep, it again sees the 
> TopicAuthorizationException from the second completedFetch and it sleeps 
> again. So it takes (1.5 * 60 * partitions) seconds before the client can see 
> any data. With this patch, the client when it sees the first 
> TopicAuthorizationException, it can all assign(EmptySet), which will get rid 
> of the buffered completedFetches (those with TopicAuthorizationException) and 
> it can again call assign(TopicPartitions) before calling poll(). With this 
> patch we found that client was able to get the records as soon as the Broker 
> fetched the ACLs from ACL store.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7307) Upgrade EasyMock for Java 11 support

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7307:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Upgrade EasyMock for Java 11 support
> 
>
> Key: KAFKA-7307
> URL: https://issues.apache.org/jira/browse/KAFKA-7307
> Project: Kafka
>  Issue Type: Sub-task
>  Components: packaging
>Reporter: Ismael Juma
>Assignee: Ismael Juma
>Priority: Major
> Fix For: 2.2.0
>
>
> A new version of EasyMock and ASM7_EXPERIMENTAL (or ASM7 when Java 11 ships) 
> enabled: https://github.com/easymock/easymock/issues/224
> EasyMocks shades its dependencies so they can't be upgraded independently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-6880) Zombie replicas must be fenced

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-6880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-6880:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Zombie replicas must be fenced
> --
>
> Key: KAFKA-6880
> URL: https://issues.apache.org/jira/browse/KAFKA-6880
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Major
>  Labels: needs-kip
> Fix For: 2.2.0
>
>
> Let's say we have three replicas for a partition: 1, 2 ,and 3.
> In epoch 0, broker 1 is the leader and writes up to offset 50. Broker 2 
> replicates up to offset 50, but broker 3 is a little behind at offset 40. The 
> high watermark is 40. 
> Suppose that broker 2 has a zk session expiration event, but fails to detect 
> it or fails to reestablish a session (e.g. due to a bug like KAFKA-6879), and 
> it continues fetching from broker 1.
> For whatever reason, broker 3 is elected the leader for epoch 1 beginning at 
> offset 40. Broker 1 detects the leader change and truncates its log to offset 
> 40. Some new data is appended up to offset 60, which is fully replicated to 
> broker 1. Broker 2 continues fetching from broker 1 at offset 50, but gets 
> NOT_LEADER_FOR_PARTITION errors, which is retriable and hence broker 2 will 
> retry.
> After some time, broker 1 becomes the leader again for epoch 2. Broker 1 
> begins at offset 60. Broker 2 has not exhausted retries and is now able to 
> fetch at offset 50 and append the last 10 records in order to catch up. 
> However, because it did not observed the leader changes, it never saw the 
> need to truncate its log. Hence offsets 40-49 still reflect the uncommitted 
> changes from epoch 0. Neither KIP-101 nor KIP-279 can fix this because the 
> tail of the log is correct.
> The basic problem is that zombie replicas are not fenced properly by the 
> leader epoch. We actually observed a sequence roughly like this after a 
> broker had partially deadlocked from KAFKA-6879. We should add the leader 
> epoch to fetch requests so that the leader can fence the zombie replicas.
> A related problem is that we currently allow such zombie replicas to be added 
> to the ISR even if they are in an offline state. The problem is that the 
> controller will never elect them, so being part of the ISR does not give the 
> availability guarantee that is intended. This would also be fixed by 
> verifying replica leader epoch in fetch requests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7383) Verify leader epoch in produce requests (KIP-359)

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7383:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Verify leader epoch in produce requests (KIP-359)
> -
>
> Key: KAFKA-7383
> URL: https://issues.apache.org/jira/browse/KAFKA-7383
> Project: Kafka
>  Issue Type: Improvement
>  Components: producer 
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Major
> Fix For: 2.2.0
>
>
> Implementation of 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-359%3A+Verify+leader+epoch+in+produce+requests.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-3821) Allow Kafka Connect source tasks to produce offset without writing to topics

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-3821:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Allow Kafka Connect source tasks to produce offset without writing to topics
> 
>
> Key: KAFKA-3821
> URL: https://issues.apache.org/jira/browse/KAFKA-3821
> Project: Kafka
>  Issue Type: Improvement
>  Components: KafkaConnect
>Affects Versions: 0.9.0.1
>Reporter: Randall Hauch
>Priority: Major
>  Labels: needs-kip
> Fix For: 2.1.0
>
>
> Provide a way for a {{SourceTask}} implementation to record a new offset for 
> a given partition without necessarily writing a source record to a topic.
> Consider a connector task that uses the same offset when producing an unknown 
> number of {{SourceRecord}} objects (e.g., it is taking a snapshot of a 
> database). Once the task completes those records, the connector wants to 
> update the offsets (e.g., the snapshot is complete) but has no more records 
> to be written to a topic. With this change, the task could simply supply an 
> updated offset.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-3821) Allow Kafka Connect source tasks to produce offset without writing to topics

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-3821:

Fix Version/s: (was: 2.2.0)
   2.1.0

> Allow Kafka Connect source tasks to produce offset without writing to topics
> 
>
> Key: KAFKA-3821
> URL: https://issues.apache.org/jira/browse/KAFKA-3821
> Project: Kafka
>  Issue Type: Improvement
>  Components: KafkaConnect
>Affects Versions: 0.9.0.1
>Reporter: Randall Hauch
>Priority: Major
>  Labels: needs-kip
> Fix For: 2.1.0
>
>
> Provide a way for a {{SourceTask}} implementation to record a new offset for 
> a given partition without necessarily writing a source record to a topic.
> Consider a connector task that uses the same offset when producing an unknown 
> number of {{SourceRecord}} objects (e.g., it is taking a snapshot of a 
> database). Once the task completes those records, the connector wants to 
> update the offsets (e.g., the snapshot is complete) but has no more records 
> to be written to a topic. With this change, the task could simply supply an 
> updated offset.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7251) Add support for TLS 1.3

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7251:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Add support for TLS 1.3
> ---
>
> Key: KAFKA-7251
> URL: https://issues.apache.org/jira/browse/KAFKA-7251
> Project: Kafka
>  Issue Type: New Feature
>  Components: security
>Reporter: Rajini Sivaram
>Priority: Major
> Fix For: 2.2.0
>
>
> Java 11 adds support for TLS 1.3. We should support this after we add support 
> for Java 11.
> Related issues:
> [https://bugs.openjdk.java.net/browse/JDK-8206170]
> [https://bugs.openjdk.java.net/browse/JDK-8206178]
> [https://bugs.openjdk.java.net/browse/JDK-8208538]
> [https://bugs.openjdk.java.net/browse/JDK-8207009]
> [https://bugs.openjdk.java.net/browse/JDK-8209893]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KAFKA-6045) All access to log should fail if log is closed

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-6045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin resolved KAFKA-6045.
-
Resolution: Won't Fix

> All access to log should fail if log is closed
> --
>
> Key: KAFKA-6045
> URL: https://issues.apache.org/jira/browse/KAFKA-6045
> Project: Kafka
>  Issue Type: Bug
>Reporter: Dong Lin
>Priority: Major
> Fix For: 2.1.0
>
>
> After log.close() or log.closeHandlers() is called for a given log, all uses 
> of the Log's API should fail with proper exception. For example, 
> log.appendAsLeader() should throw KafkaStorageException. APIs such as 
> Log.activeProducersWithLastSequence() should also fail but not necessarily 
> with KafkaStorageException, since the KafkaStorageException indicates failure 
> to access disk.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7264) Support Java 11

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7264:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Support Java 11
> ---
>
> Key: KAFKA-7264
> URL: https://issues.apache.org/jira/browse/KAFKA-7264
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Ismael Juma
>Priority: Major
> Fix For: 2.2.0
>
>
> Java 11 is the next LTS release and it should be released by the end of 
> September. Kafka should ideally support that in the 2.1.0 release. A few 
> known issues/requirements have been captured via subtasks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-6029) Controller should wait for the leader migration to finish before ack a ControlledShutdownRequest

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-6029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-6029:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Controller should wait for the leader migration to finish before ack a 
> ControlledShutdownRequest
> 
>
> Key: KAFKA-6029
> URL: https://issues.apache.org/jira/browse/KAFKA-6029
> Project: Kafka
>  Issue Type: Sub-task
>  Components: controller, core
>Affects Versions: 1.0.0
>Reporter: Jiangjie Qin
>Priority: Major
> Fix For: 2.2.0
>
>
> In the controlled shutdown process, the controller will return the 
> ControlledShutdownResponse immediately after the state machine is updated. 
> Because the LeaderAndIsrRequests and UpdateMetadataRequests may not have been 
> successfully processed by the brokers, the leader migration and active ISR 
> shrink may not have done when the shutting down broker proceeds to shut down. 
> This will cause some of the leaders to take up to replica.lag.time.max.ms to 
> kick the broker out of ISR. Meanwhile the produce purgatory size will grow.
> Ideally, the controller should wait until all the LeaderAndIsrRequests and 
> UpdateMetadataRequests has been acked before sending back the 
> ControlledShutdownResponse.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-5286) Producer should await transaction completion in close

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-5286:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Producer should await transaction completion in close
> -
>
> Key: KAFKA-5286
> URL: https://issues.apache.org/jira/browse/KAFKA-5286
> Project: Kafka
>  Issue Type: Sub-task
>  Components: clients, core, producer 
>Affects Versions: 0.11.0.0
>Reporter: Jason Gustafson
>Priority: Major
> Fix For: 2.2.0
>
>
> We should wait at least as long as the timeout for a transaction which has 
> begun completion (commit or abort) to be finished. Tricky thing is whether we 
> should abort a transaction which is in progress. It seems reasonable since 
> that's the coordinator will either timeout and abort the transaction or the 
> next producer using the same transactionalId will fence the producer and 
> abort the transaction. In any case, the transaction will be aborted, so 
> perhaps we should do it proactively.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-1120) Controller could miss a broker state change

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-1120:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Controller could miss a broker state change 
> 
>
> Key: KAFKA-1120
> URL: https://issues.apache.org/jira/browse/KAFKA-1120
> Project: Kafka
>  Issue Type: Sub-task
>  Components: core
>Affects Versions: 0.8.1
>Reporter: Jun Rao
>Assignee: Mickael Maison
>Priority: Major
>  Labels: reliability
> Fix For: 2.2.0
>
>
> When the controller is in the middle of processing a task (e.g., preferred 
> leader election, broker change), it holds a controller lock. During this 
> time, a broker could have de-registered and re-registered itself in ZK. After 
> the controller finishes processing the current task, it will start processing 
> the logic in the broker change listener. However, it will see no broker 
> change and therefore won't do anything to the restarted broker. This broker 
> will be in a weird state since the controller doesn't inform it to become the 
> leader of any partition. Yet, the cached metadata in other brokers could 
> still list that broker as the leader for some partitions. Client requests 
> routed to that broker will then get a TopicOrPartitionNotExistException. This 
> broker will continue to be in this bad state until it's restarted again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-5543) We don't remove the LastStableOffsetLag metric when a partition is moved away from a broker

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-5543:

Fix Version/s: (was: 2.1.0)
   2.2.0

> We don't remove the LastStableOffsetLag metric when a partition is moved away 
> from a broker
> ---
>
> Key: KAFKA-5543
> URL: https://issues.apache.org/jira/browse/KAFKA-5543
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.11.0.0
>Reporter: Apurva Mehta
>Assignee: Apurva Mehta
>Priority: Major
> Fix For: 2.2.0
>
>
> Reported by [~junrao], we have a small leak where the `LastStableOffsetLag` 
> metric is not removed along with the other metrics in the 
> `Partition.removeMetrics` method. This could create a leak when partitions 
> are reassigned or a topic is deleted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-5347) OutOfSequence error should be fatal

2018-10-03 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-5347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16636544#comment-16636544
 ] 

Dong Lin commented on KAFKA-5347:
-

Moving this to 2.2.0 since there is no PR ready.

> OutOfSequence error should be fatal
> ---
>
> Key: KAFKA-5347
> URL: https://issues.apache.org/jira/browse/KAFKA-5347
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Jason Gustafson
>Assignee: Apurva Mehta
>Priority: Major
>  Labels: exactly-once
> Fix For: 2.2.0
>
>
> If the producer sees an OutOfSequence error for a given partition, we 
> currently treat it as an abortable error. This makes some sense because 
> OutOfSequence won't prevent us from being able to send the EndTxn to abort 
> the transaction. The problem is that the producer, even after aborting, still 
> won't be able to send to the topic with an OutOfSequence. One way to deal 
> with this is to ask the user to call {{initTransactions()}} again to bump the 
> epoch, but this is a bit difficult to explain and could be dangerous since it 
> renders zombie checking less effective. Probably we should just consider 
> OutOfSequence fatal for the transactional producer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-5347) OutOfSequence error should be fatal

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-5347:

Fix Version/s: (was: 2.1.0)
   2.2.0

> OutOfSequence error should be fatal
> ---
>
> Key: KAFKA-5347
> URL: https://issues.apache.org/jira/browse/KAFKA-5347
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Jason Gustafson
>Assignee: Apurva Mehta
>Priority: Major
>  Labels: exactly-once
> Fix For: 2.2.0
>
>
> If the producer sees an OutOfSequence error for a given partition, we 
> currently treat it as an abortable error. This makes some sense because 
> OutOfSequence won't prevent us from being able to send the EndTxn to abort 
> the transaction. The problem is that the producer, even after aborting, still 
> won't be able to send to the topic with an OutOfSequence. One way to deal 
> with this is to ask the user to call {{initTransactions()}} again to bump the 
> epoch, but this is a bit difficult to explain and could be dangerous since it 
> renders zombie checking less effective. Probably we should just consider 
> OutOfSequence fatal for the transactional producer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-5527) Idempotent/transactional Producer part 2 (KIP-98)

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-5527:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Idempotent/transactional Producer part 2 (KIP-98)
> -
>
> Key: KAFKA-5527
> URL: https://issues.apache.org/jira/browse/KAFKA-5527
> Project: Kafka
>  Issue Type: New Feature
>Reporter: Ismael Juma
>Priority: Major
> Fix For: 2.2.0
>
>
> KAFKA-4815 tracks the items that were included in 0.11.0.0. This JIRA is for 
> tracking the ones that did not make it. Setting "Fix version" to 0.11.1.0, 
> but that is subject to change.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-5964) Add more unit tests for SslTransportLayer

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-5964:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Add more unit tests for SslTransportLayer
> -
>
> Key: KAFKA-5964
> URL: https://issues.apache.org/jira/browse/KAFKA-5964
> Project: Kafka
>  Issue Type: Test
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
>Priority: Major
> Fix For: 2.2.0
>
>
> Add unit tests for the  edge cases updated in KAFKA-5920:
> 1. Test that handshake failures are propagated as SslAuthenticationException 
> even if there are I/O exceptions in any of the read/write operations
> 2. Test that received data is processed even after an I/O exception



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-5782) Avoid unnecessary PID reset when expire batches.

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-5782:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Avoid unnecessary PID reset when expire batches.
> 
>
> Key: KAFKA-5782
> URL: https://issues.apache.org/jira/browse/KAFKA-5782
> Project: Kafka
>  Issue Type: Improvement
>  Components: producer 
>Affects Versions: 0.11.0.0
>Reporter: Jiangjie Qin
>Priority: Major
> Fix For: 2.2.0
>
>
> This is more of an efficiency optimization. Currently we will reset PID when 
> batch expiration happens and one of the expired batches is in retry mode. 
> This is assuming that we don't know if the batch in retry has been appended 
> to the broker or not. However, if the batch was in retry due to a retriable 
> exception returned by the broker, the batch is not appended. In this case, we 
> do not need to reset the PID.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-5834) AbstractConfig.logUnused() may log confusing warning information.

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-5834:

Fix Version/s: (was: 2.1.0)
   2.2.0

> AbstractConfig.logUnused() may log confusing warning information.
> -
>
> Key: KAFKA-5834
> URL: https://issues.apache.org/jira/browse/KAFKA-5834
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Reporter: Jiangjie Qin
>Priority: Major
> Fix For: 2.2.0
>
>
> Currently {{AbstractConfig.logUnused()}} logs unused configurations in at 
> WARN level. It is a little weird because as long as there is a configurable 
> class taking a configuration, that configuration will be logged as unused at 
> WARN level even if it is actually used. It seems better to make it an INFO 
> level logging instead, or maybe it can take a log level argument to allow 
> caller to decide which log level should be used.
> [~hachikuji] [~ijuma] what do you think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KAFKA-5950) AdminClient should retry based on returned error codes

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin resolved KAFKA-5950.
-
Resolution: Fixed

Per comment in the PR, it appears that the issue has been addressed in 
KAFKA-6299 with PR [https://github.com/apache/kafka/pull/4295.] Please feel 
free to re-open if this is still an issue.

> AdminClient should retry based on returned error codes
> --
>
> Key: KAFKA-5950
> URL: https://issues.apache.org/jira/browse/KAFKA-5950
> Project: Kafka
>  Issue Type: Bug
>Reporter: Ismael Juma
>Assignee: Andrey Dyachkov
>Priority: Major
>  Labels: needs-kip
> Fix For: 2.2.0
>
>
> The AdminClient only retries if the request fails with a retriable error. If 
> a response is returned, then a retry is never attempted. This is inconsistent 
> with other clients that check the error codes in the response and retry for 
> each retriable error code.
> We should consider if it makes sense to adopt this behaviour in the 
> AdminClient so that users don't have to do it themselves.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-5950) AdminClient should retry based on returned error codes

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-5950:

Fix Version/s: (was: 2.1.0)
   2.2.0

> AdminClient should retry based on returned error codes
> --
>
> Key: KAFKA-5950
> URL: https://issues.apache.org/jira/browse/KAFKA-5950
> Project: Kafka
>  Issue Type: Bug
>Reporter: Ismael Juma
>Assignee: Andrey Dyachkov
>Priority: Major
>  Labels: needs-kip
> Fix For: 2.2.0
>
>
> The AdminClient only retries if the request fails with a retriable error. If 
> a response is returned, then a retry is never attempted. This is inconsistent 
> with other clients that check the error codes in the response and retry for 
> each retriable error code.
> We should consider if it makes sense to adopt this behaviour in the 
> AdminClient so that users don't have to do it themselves.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (KAFKA-5857) Excessive heap usage on controller node during reassignment

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin resolved KAFKA-5857.
-
Resolution: Cannot Reproduce

In general we need heapdump to investigate the issue. Since we don't have 
heapdump when the issue happened and there is significant improvement in the 
heap size used by controller in 1.1.0 per Ismael's comment, it seems reasonable 
to just close this JIRA. We can re-open it if there is still the same issue 
with Kafka 1.1.0 or later and we have heapdump.

> Excessive heap usage on controller node during reassignment
> ---
>
> Key: KAFKA-5857
> URL: https://issues.apache.org/jira/browse/KAFKA-5857
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 0.11.0.0
> Environment: CentOs 7, Java 1.8
>Reporter: Raoufeh Hashemian
>Priority: Major
>  Labels: reliability
> Fix For: 2.1.0
>
> Attachments: CPU.png, disk_write_x.png, memory.png, 
> reassignment_plan.txt
>
>
> I was trying to expand our kafka cluster of 6 broker nodes to 12 broker 
> nodes. 
> Before expansion, we had a single topic with 960 partitions and a replication 
> factor of 3. So each node had 480 partitions. The size of data in each node 
> was 3TB . 
> To do the expansion, I submitted a partition reassignment plan (see attached 
> file for the current/new assignments). The plan was optimized to minimize 
> data movement and be rack aware. 
> When I submitted the plan, it took approximately 3 hours for moving data from 
> old to new nodes to complete. After that, it started deleting source 
> partitions (I say this based on the number of file descriptors) and 
> rebalancing leaders which has not been successful. Meanwhile, the heap usage 
> in the controller node started to go up with a large slope (along with long 
> GC times) and it took 5 hours for the controller to go out of memory and 
> another controller started to have the same behaviour for another 4 hours. At 
> this time the zookeeper ran out of disk and the service stopped.
> To recover from this condition:
> 1) Removed zk logs to free up disk and restarted all 3 zk nodes
> 2) Deleted /kafka/admin/reassign_partitions node from zk
> 3) Had to do unclean restarts of kafka service on oom controller nodes which 
> took 3 hours to complete  . After this stage there was still 676 under 
> replicated partitions.
> 4) Do a clean restart on all 12 broker nodes.
> After step 4 , number of under replicated nodes went to 0.
> So I was wondering if this memory footprint from controller is expected for 
> 1k partitions ? Did we do sth wrong or it is a bug?
> Attached are some resource usage graph during this 30 hours event and the 
> reassignment plan. I'll try to add log files as well



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-6304) The controller should allow updating the partition reassignment for the partitions being reassigned

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-6304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-6304:

Fix Version/s: (was: 2.1.0)
   2.2.0

> The controller should allow updating the partition reassignment for the 
> partitions being reassigned
> ---
>
> Key: KAFKA-6304
> URL: https://issues.apache.org/jira/browse/KAFKA-6304
> Project: Kafka
>  Issue Type: Improvement
>  Components: controller, core
>Affects Versions: 1.0.0
>Reporter: Jiangjie Qin
>Priority: Major
> Fix For: 2.2.0
>
>
> Currently the controller will not process the partition reassignment of a 
> partition if the partition is already being reassigned.
> The issue is that if there is a broker failure during the partition 
> reassignment, the partition reassignment may never finish. And the users may 
> want to cancel the partition reassignment. However, the controller will 
> refuse to do that unless user manually deletes the partition reassignment zk 
> path, force a controller switch and then issue the revert command. This is 
> pretty involved. It seems reasonable for the controller to replace the 
> ongoing stuck reassignment and replace it with the updated partition 
> assignment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7198) Enhance KafkaStreams start method javadoc

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7198:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Enhance KafkaStreams start method javadoc
> -
>
> Key: KAFKA-7198
> URL: https://issues.apache.org/jira/browse/KAFKA-7198
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Bill Bejeck
>Priority: Major
>  Labels: newbie
> Fix For: 2.2.0
>
>
> The {{KafkaStreams.start}} method javadoc states that once called the streams 
> threads are started in the background hence the method does not block.  
> However you have GlobalKTables in your topology, the threads aren't started 
> until the GlobalKTables bootstrap fully so the javadoc for the {{start}} 
> method should be updated to reflect this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7051) Improve the efficiency of the ReplicaManager when there are many partitions

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7051:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Improve the efficiency of the ReplicaManager when there are many partitions
> ---
>
> Key: KAFKA-7051
> URL: https://issues.apache.org/jira/browse/KAFKA-7051
> Project: Kafka
>  Issue Type: Bug
>  Components: replication
>Affects Versions: 0.8.0
>Reporter: Colin P. McCabe
>Assignee: Colin P. McCabe
>Priority: Major
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-5693) TopicCreationPolicy and AlterConfigsPolicy overlap

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-5693:

Fix Version/s: (was: 2.1.0)
   2.2.0

> TopicCreationPolicy and AlterConfigsPolicy overlap
> --
>
> Key: KAFKA-5693
> URL: https://issues.apache.org/jira/browse/KAFKA-5693
> Project: Kafka
>  Issue Type: Bug
>Reporter: Tom Bentley
>Priority: Minor
>  Labels: kip
> Fix For: 2.2.0
>
>
> The administrator of a cluster can configure a {{CreateTopicPolicy}}, which 
> has access to the topic configs as well as other metadata to make its 
> decision about whether a topic creation is allowed. Thus in theory the 
> decision could be based on a combination of of the replication factor, and 
> the topic configs, for example. 
> Separately there is an AlterConfigPolicy, which only has access to the 
> configs (and can apply to configurable entities other than just topics).
> There are potential issues with this. For example although the 
> CreateTopicPolicy is checked at creation time, it's not checked for any later 
> alterations to the topic config. So policies which depend on both the topic 
> configs and other topic metadata could be worked around by changing the 
> configs after creation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-5532) Making bootstrap.servers property a first citizen option for the ProducerPerformance

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-5532:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Making bootstrap.servers property a first citizen option for the 
> ProducerPerformance
> 
>
> Key: KAFKA-5532
> URL: https://issues.apache.org/jira/browse/KAFKA-5532
> Project: Kafka
>  Issue Type: Improvement
>  Components: tools
>Reporter: Paolo Patierno
>Assignee: Paolo Patierno
>Priority: Trivial
> Fix For: 2.2.0
>
>
> Hi,
> using the ProducerPerformance tool you have to specify the bootstrap.servers 
> option using the producer-props or producer-config option. It could be better 
> having bootstrap.servers as a first citizen option like all the other tools, 
> so a dedicate --bootstrap-servers option.
> Thanks,
> Paolo.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-4125) Provide low-level Processor API meta data in DSL layer

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-4125:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Provide low-level Processor API meta data in DSL layer
> --
>
> Key: KAFKA-4125
> URL: https://issues.apache.org/jira/browse/KAFKA-4125
> Project: Kafka
>  Issue Type: New Feature
>  Components: streams
>Reporter: Matthias J. Sax
>Priority: Minor
>  Labels: kip
> Fix For: 2.2.0
>
>
> For Processor API, user can get meta data like record offset, timestamp etc 
> via the provided {{Context}} object. It might be useful to allow uses to 
> access this information in DSL layer, too.
> The idea would be, to do it "the Flink way", ie, by providing
> RichFunctions; {{mapValue()}} for example.
> Is takes a {{ValueMapper}} that only has method
> {noformat}
> V2 apply(V1 value);
> {noformat}
> Thus, you cannot get any meta data within apply (it's completely "blind").
> We would add two more interfaces: {{RichFunction}} with a method
> {{open(Context context)}} and
> {noformat}
> RichValueMapper extends ValueMapper, RichFunction
> {noformat}
> This way, the user can chose to implement Rich- or Standard-function and
> we do not need to change existing APIs. Both can be handed into
> {{KStream.mapValues()}} for example. Internally, we check if a Rich
> function is provided, and if yes, hand in the {{Context}} object once, to
> make it available to the user who can now access it within {{apply()}} -- or
> course, the user must set a member variable in {{open()}} to hold the
> reference to the Context object.
> KIP: 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-159%3A+Introducing+Rich+functions+to+Streams



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-4693) Consumer subscription change during rebalance causes exception

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-4693:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Consumer subscription change during rebalance causes exception
> --
>
> Key: KAFKA-4693
> URL: https://issues.apache.org/jira/browse/KAFKA-4693
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Reporter: Jason Gustafson
>Priority: Minor
> Fix For: 2.2.0
>
>
> After every rebalance, the consumer validates that the assignment received 
> contains only partitions from topics that were subscribed. If not, then we 
> raise an exception to the user. It is possible for a wakeup or an interrupt 
> to leave the consumer with a rebalance in progress (e.g. with a JoinGroup to 
> the coordinator in-flight). If the user then changes the topic subscription, 
> then this validation upon completion of the rebalance will fail. We should 
> probably detect the subscription change, eat the exception, and request 
> another rebalance. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7266) Fix MetricsTest test flakiness

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7266:

Fix Version/s: (was: 2.0.1)
   (was: 2.1.0)
   2.2.0

> Fix MetricsTest test flakiness
> --
>
> Key: KAFKA-7266
> URL: https://issues.apache.org/jira/browse/KAFKA-7266
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Stanislav Kozlovski
>Assignee: Stanislav Kozlovski
>Priority: Minor
> Fix For: 2.2.0
>
>
> The test `kafka.api.MetricsTest.testMetrics` has been failing intermittently 
> in kafka builds (recent proof: 
> https://github.com/apache/kafka/pull/5436#issuecomment-409683955)
> The particular failure is in the `MessageConversionsTimeMs` metric assertion -
> {code}
> java.lang.AssertionError: Message conversion time not recorded 0.0
> {code}
> There has been work done previously 
> (https://github.com/apache/kafka/pull/4681) to combat the flakiness of the 
> test and while it has improved it, the test still fails sometimes.
> h3. Solution
> On my machine, the test failed 5 times out of 25 runs. Increasing the record 
> size and using compression should slow down message conversion enough to have 
> it be above 1ms. Locally this test has not failed in 200 runs and counting 
> with those changes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7259) Remove deprecated ZKUtils usage from ZkSecurityMigrator

2018-10-03 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7259:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Remove deprecated ZKUtils usage from ZkSecurityMigrator
> ---
>
> Key: KAFKA-7259
> URL: https://issues.apache.org/jira/browse/KAFKA-7259
> Project: Kafka
>  Issue Type: Task
>  Components: core
>Reporter: Manikumar
>Assignee: Manikumar
>Priority: Minor
> Fix For: 2.2.0
>
>
> ZkSecurityMigrator code currently uses ZKUtils.  We can replace ZKUtils usage 
> with KafkaZkClient. Also remove usage of ZKUtils from various tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7196) Remove heartbeat delayed operation for those removed consumers at the end of each rebalance

2018-10-02 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7196:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Remove heartbeat delayed operation for those removed consumers at the end of 
> each rebalance
> ---
>
> Key: KAFKA-7196
> URL: https://issues.apache.org/jira/browse/KAFKA-7196
> Project: Kafka
>  Issue Type: Bug
>  Components: core, purgatory
>Reporter: Lincong Li
>Assignee: Lincong Li
>Priority: Minor
> Fix For: 2.2.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> During the consumer group rebalance, when the joining group phase finishes, 
> the heartbeat delayed operation of the consumer that fails to rejoin the 
> group should be removed from the purgatory. Otherwise, even though the member 
> ID of the consumer has been removed from the group, its heartbeat delayed 
> operation is still registered in the purgatory and the heartbeat delayed 
> operation is going to timeout and then another unnecessary rebalance is 
> triggered because of it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-3596) Kafka Streams: Window expiration needs to consider more than event time

2018-10-02 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-3596:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Kafka Streams: Window expiration needs to consider more than event time
> ---
>
> Key: KAFKA-3596
> URL: https://issues.apache.org/jira/browse/KAFKA-3596
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Affects Versions: 0.10.0.0
>Reporter: Henry Cai
>Priority: Minor
>  Labels: architecture
> Fix For: 2.2.0
>
>
> Currently in Kafka Streams, the way the windows are expired in RocksDB is 
> triggered by new event insertion.  When a window is created at T0 with 10 
> minutes retention, when we saw a new record coming with event timestamp T0 + 
> 10 +1, we will expire that window (remove it) out of RocksDB.
> In the real world, it's very easy to see event coming with future timestamp 
> (or out-of-order events coming with big time gaps between events), this way 
> of retiring a window based on one event's event timestamp is dangerous.  I 
> think at least we need to consider both the event's event time and 
> server/stream time elapse.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-7377) update metrics module from yammer to dropwizrd

2018-10-02 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-7377:

Fix Version/s: (was: 1.1.2)
   (was: 2.0.1)
   (was: 1.0.3)
   (was: 2.1.0)
   2.2.0

> update metrics module from yammer to dropwizrd
> --
>
> Key: KAFKA-7377
> URL: https://issues.apache.org/jira/browse/KAFKA-7377
> Project: Kafka
>  Issue Type: Improvement
>  Components: metrics
>Affects Versions: 1.0.0, 1.1.1, 2.0.0
>Reporter: RAJKUMAR NATARAJAN
>Priority: Minor
> Fix For: 2.2.0
>
>
> Current kafka metrics depends on yammers. Please see depencies below - 
> [https://mvnrepository.com/artifact/org.apache.kafka/kafka_2.12/2.0.0]
>  
> Yammer is outdated long time back. It would be good to update the metrics 
> module to dropwizard metrics.
>  
> [https://mvnrepository.com/artifact/com.codahale.metrics/metrics-core]
>  
> [https://mvnrepository.com/artifact/io.dropwizard.metrics/metrics-core]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-5214) Re-add KafkaAdminClient#apiVersions

2018-10-02 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-5214:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Re-add KafkaAdminClient#apiVersions
> ---
>
> Key: KAFKA-5214
> URL: https://issues.apache.org/jira/browse/KAFKA-5214
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Colin P. McCabe
>Assignee: Colin P. McCabe
>Priority: Minor
> Fix For: 2.2.0
>
>
> We removed KafkaAdminClient#apiVersions just before 0.11.0.0 to give us a bit 
> more time to iterate on it before it's included in a release. We should add 
> the relevant methods back.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-5692) Refactor PreferredReplicaLeaderElectionCommand to use AdminClient

2018-10-02 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-5692:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Refactor PreferredReplicaLeaderElectionCommand to use AdminClient
> -
>
> Key: KAFKA-5692
> URL: https://issues.apache.org/jira/browse/KAFKA-5692
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Tom Bentley
>Assignee: Tom Bentley
>Priority: Minor
>  Labels: kip, patch-available
> Fix For: 2.2.0
>
>
> The PreferredReplicaLeaderElectionCommand currently uses a direct connection 
> to zookeeper. The zookeeper dependency should be deprecated and an 
> AdminClient API created to be used instead. 
> This change will require a KIP.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-3575) Use console consumer access topic that does not exist, can not use "Control + C" to exit process

2018-10-02 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-3575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-3575:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Use console consumer access topic that does not exist, can not use "Control + 
> C" to exit process
> 
>
> Key: KAFKA-3575
> URL: https://issues.apache.org/jira/browse/KAFKA-3575
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.9.0.0
> Environment: SUSE Linux Enterprise Server 11 SP3
>Reporter: NieWang
>Assignee: Tom Bentley
>Priority: Minor
>  Labels: patch-available
> Fix For: 2.2.0
>
>
> 1.  use "sh kafka-console-consumer.sh --zookeeper 10.252.23.133:2181 --topic 
> topic_02"  start console consumer. topic_02 does not exist.
> 2. you can not use "Control + C" to exit console consumer process. The 
> process is blocked.
> 3. use jstack check process stack, as follows:
> linux:~ # jstack 122967
> 2016-04-18 15:46:06
> Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.66-b17 mixed mode):
> "Attach Listener" #29 daemon prio=9 os_prio=0 tid=0x01781800 
> nid=0x1e0c8 waiting on condition [0x]
>java.lang.Thread.State: RUNNABLE
> "Thread-4" #27 prio=5 os_prio=0 tid=0x018a4000 nid=0x1e08a waiting on 
> condition [0x7ffbe5ac]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0xe00ed3b8> (a 
> java.util.concurrent.CountDownLatch$Sync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
> at kafka.tools.ConsoleConsumer$$anon$1.run(ConsoleConsumer.scala:101)
> "SIGINT handler" #28 daemon prio=9 os_prio=0 tid=0x019d5800 
> nid=0x1e089 in Object.wait() [0x7ffbe5bc1000]
>java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.$$YJP$$wait(Native Method)
> at java.lang.Object.wait(Object.java)
> at java.lang.Thread.join(Thread.java:1245)
> - locked <0xe71fd4e8> (a kafka.tools.ConsoleConsumer$$anon$1)
> at java.lang.Thread.join(Thread.java:1319)
> at 
> java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:106)
> at 
> java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:46)
> at java.lang.Shutdown.runHooks(Shutdown.java:123)
> at java.lang.Shutdown.sequence(Shutdown.java:167)
> at java.lang.Shutdown.exit(Shutdown.java:212)
> - locked <0xe00abfd8> (a java.lang.Class for 
> java.lang.Shutdown)
> at java.lang.Terminator$1.handle(Terminator.java:52)
> at sun.misc.Signal$1.run(Signal.java:212)
> at java.lang.Thread.run(Thread.java:745)
> "metrics-meter-tick-thread-2" #20 daemon prio=5 os_prio=0 
> tid=0x7ffbec77a800 nid=0x1e079 waiting on condition [0x7ffbe66c8000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0xe6fa6438> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1088)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
> at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> "metrics-meter-tick-thread-1" #19 daemon prio=5 os_prio=0 
> tid=0x7ffbec783000 nid=0x1e078 waiting on condition [0x7ffbe67c9000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0xe6fa6438> (a 
> java.util.concurrent.locks

[jira] [Updated] (KAFKA-3733) Avoid long command lines by setting CLASSPATH in environment

2018-10-02 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-3733:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Avoid long command lines by setting CLASSPATH in environment
> 
>
> Key: KAFKA-3733
> URL: https://issues.apache.org/jira/browse/KAFKA-3733
> Project: Kafka
>  Issue Type: Improvement
>  Components: tools
>Reporter: Adrian Muraru
>Assignee: Adrian Muraru
>Priority: Minor
> Fix For: 2.2.0
>
>
> {{kafka-run-class.sh}} sets the JVM classpath in the command line via {{-cp}}.
> This generates long command lines that gets trimmed by the shell in commands 
> like ps, pgrep,etc.
> An alternative is to set the CLASSPATH in environment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-3999) Consumer bytes-fetched metric uses decompressed message size (KIP-264)

2018-10-02 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-3999:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Consumer bytes-fetched metric uses decompressed message size (KIP-264)
> --
>
> Key: KAFKA-3999
> URL: https://issues.apache.org/jira/browse/KAFKA-3999
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 0.9.0.1, 0.10.0.0
>Reporter: Jason Gustafson
>Assignee: Vahid Hashemian
>Priority: Minor
>  Labels: kip
> Fix For: 2.2.0
>
>
> It looks like the computation for the bytes-fetched metrics uses the size of 
> the decompressed message set. I would have expected it to be based off of the 
> raw size of the fetch responses. Perhaps it would be helpful to expose both 
> the raw and decompressed fetch sizes? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-4794) Add access to OffsetStorageReader from SourceConnector

2018-10-02 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-4794:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Add access to OffsetStorageReader from SourceConnector
> --
>
> Key: KAFKA-4794
> URL: https://issues.apache.org/jira/browse/KAFKA-4794
> Project: Kafka
>  Issue Type: Improvement
>  Components: KafkaConnect
>Affects Versions: 0.10.2.0
>Reporter: Florian Hussonnois
>Priority: Minor
>  Labels: needs-kip
> Fix For: 2.2.0
>
>
> Currently the offsets storage is only accessible from SourceTask to able to 
> initialize properly tasks after a restart, a crash or a reconfiguration 
> request.
> To implement more complex connectors that need to track the progression of 
> each task it would helpful to have access to an OffsetStorageReader instance 
> from the SourceConnector.
> In that way, we could have a background thread that could request a tasks 
> reconfiguration based on source offsets.
> This improvement proposal comes from a customer project that needs to 
> periodically scan directories on a shared storage for detecting and for 
> streaming new files into Kafka.
> The connector implementation is pretty straightforward.
> The connector uses a background thread to periodically scan directories. When 
> new inputs files are detected a tasks reconfiguration is requested. Then the 
> connector assigns a file subset to each task. 
> Each task stores sources offsets for the last sent record. The source offsets 
> data are:
>  - the size of file
>  - the bytes offset
>  - the bytes size 
> Tasks become idle when the assigned files are completed (in : 
> recordBytesOffsets + recordBytesSize = fileBytesSize).
> Then, the connector should be able to track offsets for each assigned file. 
> When all tasks has finished the connector can stop them or assigned new files 
> by requesting tasks reconfiguration. 
> Moreover, another advantage of monitoring source offsets from the connector 
> is detect slow or failed tasks and if necessary to be able to restart all 
> tasks.
> If you think this improvement is OK, I can work a pull request.
> Thanks,



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-4931) stop script fails due 4096 ps output limit

2018-10-02 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-4931:

Fix Version/s: (was: 2.1.0)
   2.2.0

> stop script fails due 4096 ps output limit
> --
>
> Key: KAFKA-4931
> URL: https://issues.apache.org/jira/browse/KAFKA-4931
> Project: Kafka
>  Issue Type: Bug
>  Components: tools
>Affects Versions: 0.10.2.0
>Reporter: Amit Jain
>Assignee: Tom Bentley
>Priority: Minor
>  Labels: patch-available
> Fix For: 2.2.0
>
>
> When run the script: bin/zookeeper-server-stop.sh fails to stop the zookeeper 
> server process if the ps output exceeds 4096 character limit of linux. I 
> think instead of ps we can use ${JAVA_HOME}/bin/jps -vl | grep QuorumPeerMain 
>  it would correctly stop zookeeper process. Currently we are using kill 
> PIDS=$(ps ax | grep java | grep -i QuorumPeerMain | grep -v grep | awk 
> '{print $1}')



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-4794) Add access to OffsetStorageReader from SourceConnector

2018-10-02 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16636527#comment-16636527
 ] 

Dong Lin commented on KAFKA-4794:
-

Since there has been no activity on this for several months, moving this out to 
2.2.0

> Add access to OffsetStorageReader from SourceConnector
> --
>
> Key: KAFKA-4794
> URL: https://issues.apache.org/jira/browse/KAFKA-4794
> Project: Kafka
>  Issue Type: Improvement
>  Components: KafkaConnect
>Affects Versions: 0.10.2.0
>Reporter: Florian Hussonnois
>Priority: Minor
>  Labels: needs-kip
> Fix For: 2.2.0
>
>
> Currently the offsets storage is only accessible from SourceTask to able to 
> initialize properly tasks after a restart, a crash or a reconfiguration 
> request.
> To implement more complex connectors that need to track the progression of 
> each task it would helpful to have access to an OffsetStorageReader instance 
> from the SourceConnector.
> In that way, we could have a background thread that could request a tasks 
> reconfiguration based on source offsets.
> This improvement proposal comes from a customer project that needs to 
> periodically scan directories on a shared storage for detecting and for 
> streaming new files into Kafka.
> The connector implementation is pretty straightforward.
> The connector uses a background thread to periodically scan directories. When 
> new inputs files are detected a tasks reconfiguration is requested. Then the 
> connector assigns a file subset to each task. 
> Each task stores sources offsets for the last sent record. The source offsets 
> data are:
>  - the size of file
>  - the bytes offset
>  - the bytes size 
> Tasks become idle when the assigned files are completed (in : 
> recordBytesOffsets + recordBytesSize = fileBytesSize).
> Then, the connector should be able to track offsets for each assigned file. 
> When all tasks has finished the connector can stop them or assigned new files 
> by requesting tasks reconfiguration. 
> Moreover, another advantage of monitoring source offsets from the connector 
> is detect slow or failed tasks and if necessary to be able to restart all 
> tasks.
> If you think this improvement is OK, I can work a pull request.
> Thanks,



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-4126) No relevant log when the topic is non-existent

2018-10-02 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-4126:

Fix Version/s: (was: 2.1.0)
   2.2.0

> No relevant log when the topic is non-existent
> --
>
> Key: KAFKA-4126
> URL: https://issues.apache.org/jira/browse/KAFKA-4126
> Project: Kafka
>  Issue Type: Bug
>Reporter: Balázs Barnabás
>Assignee: Vahid Hashemian
>Priority: Minor
> Fix For: 2.2.0
>
>
> When a producer sends a ProducerRecord into a Kafka topic that doesn't 
> existst, there is no relevant debug/error log that points out the error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-5359) Exceptions from RequestFuture lack parts of the stack trace

2018-10-02 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-5359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16636526#comment-16636526
 ] 

Dong Lin commented on KAFKA-5359:
-

Since there has been no activity on this for several months, moving this out to 
2.2.0

 

> Exceptions from RequestFuture lack parts of the stack trace
> ---
>
> Key: KAFKA-5359
> URL: https://issues.apache.org/jira/browse/KAFKA-5359
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Reporter: Magnus Reftel
>Assignee: Vahid Hashemian
>Priority: Minor
> Fix For: 2.2.0
>
>
> When an exception occurs within a task that reports its result using a 
> RequestFuture, that exception is stored in a field on the RequestFuture using 
> the {{raise}} method. In many places in the code where such futures are 
> completed, that exception is then thrown directly using {{throw 
> future.exception();}} (see e.g. 
> [Fetcher.getTopicMetadata|https://github.com/apache/kafka/blob/aebba89a2b9b5ea6a7cab2599555232ef3fe21ad/clients/src/main/java/org/apache/kafka/clients/consumer/internals/Fetcher.java#L316]).
> This means that the exception that ends up in client code only has stack 
> traces related to the original exception, but nothing leading up to the 
> completion of the future. The client therefore gets no indication of what was 
> going on in the client code - only that it somehow ended up in the Kafka 
> libraries, and that a task failed at some point.
> One solution to this is to use the exceptions from the future as causes for 
> chained exceptions, so that the client gets a stack trace that shows what the 
> client was doing, in addition to getting the stack traces for the exception 
> in the task.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-5359) Exceptions from RequestFuture lack parts of the stack trace

2018-10-02 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-5359:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Exceptions from RequestFuture lack parts of the stack trace
> ---
>
> Key: KAFKA-5359
> URL: https://issues.apache.org/jira/browse/KAFKA-5359
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Reporter: Magnus Reftel
>Assignee: Vahid Hashemian
>Priority: Minor
> Fix For: 2.2.0
>
>
> When an exception occurs within a task that reports its result using a 
> RequestFuture, that exception is stored in a field on the RequestFuture using 
> the {{raise}} method. In many places in the code where such futures are 
> completed, that exception is then thrown directly using {{throw 
> future.exception();}} (see e.g. 
> [Fetcher.getTopicMetadata|https://github.com/apache/kafka/blob/aebba89a2b9b5ea6a7cab2599555232ef3fe21ad/clients/src/main/java/org/apache/kafka/clients/consumer/internals/Fetcher.java#L316]).
> This means that the exception that ends up in client code only has stack 
> traces related to the original exception, but nothing leading up to the 
> completion of the future. The client therefore gets no indication of what was 
> going on in the client code - only that it somehow ended up in the Kafka 
> libraries, and that a task failed at some point.
> One solution to this is to use the exceptions from the future as causes for 
> chained exceptions, so that the client gets a stack trace that shows what the 
> client was doing, in addition to getting the stack traces for the exception 
> in the task.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-6448) Mx4jLoader#props.getBoolean("kafka_mx4jenable", false) conflict with the annotation

2018-10-02 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-6448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-6448:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Mx4jLoader#props.getBoolean("kafka_mx4jenable", false) conflict with the 
> annotation
> ---
>
> Key: KAFKA-6448
> URL: https://issues.apache.org/jira/browse/KAFKA-6448
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 1.0.0
>Reporter: Hongyuan Li
>Priority: Minor
> Fix For: 2.2.0
>
> Attachments: KAFKA-6448-1.patch, KAFKA-6448-2.patch
>
>
> In the annotation, it said 
> {code}*This feature must be enabled with -Dmx4jenable=true*{code}
> *which is not compatible with the code* 
> {code}
> **
> props.getBoolean("kafka_mx4jenable", false)
>  **
> {code}
> patch KAFKA-6448-1.patch modifies the code, and KAFKA-6448-2.patch modifies 
> the annotation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-5517) Support linking to particular configuration parameters

2018-10-02 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-5517:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Support linking to particular configuration parameters
> --
>
> Key: KAFKA-5517
> URL: https://issues.apache.org/jira/browse/KAFKA-5517
> Project: Kafka
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Tom Bentley
>Assignee: Tom Bentley
>Priority: Minor
>  Labels: patch-available
> Fix For: 2.2.0
>
>
> Currently the configuration parameters are documented long tables, and it's 
> only possible to link to the heading before a particular table. When 
> discussing configuration parameters on forums it would be helpful to be able 
> to link to the particular parameter under discussion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-5479) Docs for authorization omit authorizer.class.name

2018-10-02 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-5479:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Docs for authorization omit authorizer.class.name
> -
>
> Key: KAFKA-5479
> URL: https://issues.apache.org/jira/browse/KAFKA-5479
> Project: Kafka
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Tom Bentley
>Assignee: Tom Bentley
>Priority: Minor
>  Labels: patch-available
> Fix For: 2.2.0
>
>
> The documentation in §7.4 Authorization and ACLs doesn't mention the 
> {{authorizer.class.name}} setting. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-4893) async topic deletion conflicts with max topic length

2018-10-02 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16636524#comment-16636524
 ] 

Dong Lin commented on KAFKA-4893:
-

Since there has been no activity on this for several weeks, moving this out to 
2.2.0

> async topic deletion conflicts with max topic length
> 
>
> Key: KAFKA-4893
> URL: https://issues.apache.org/jira/browse/KAFKA-4893
> Project: Kafka
>  Issue Type: Bug
>Reporter: Onur Karaman
>Assignee: Vahid Hashemian
>Priority: Minor
> Fix For: 2.2.0
>
>
> As per the 
> [documentation|http://kafka.apache.org/documentation/#basic_ops_add_topic], 
> topics can be only 249 characters long to line up with typical filesystem 
> limitations:
> {quote}
> Each sharded partition log is placed into its own folder under the Kafka log 
> directory. The name of such folders consists of the topic name, appended by a 
> dash (\-) and the partition id. Since a typical folder name can not be over 
> 255 characters long, there will be a limitation on the length of topic names. 
> We assume the number of partitions will not ever be above 100,000. Therefore, 
> topic names cannot be longer than 249 characters. This leaves just enough 
> room in the folder name for a dash and a potentially 5 digit long partition 
> id.
> {quote}
> {{kafka.common.Topic.maxNameLength}} is set to 249 and is used during 
> validation.
> This limit ends up not being quite right since topic deletion ends up 
> renaming the directory to the form {{topic-partition.uniqueId-delete}} as can 
> be seen in {{LogManager.asyncDelete}}:
> {code}
> val dirName = new StringBuilder(removedLog.name)
>   .append(".")
>   
> .append(java.util.UUID.randomUUID.toString.replaceAll("-",""))
>   .append(Log.DeleteDirSuffix)
>   .toString()
> {code}
> So the unique id and "-delete" suffix end up hogging some of the characters. 
> Deleting a long-named topic results in a log message such as the following:
> {code}
> kafka.common.KafkaStorageException: Failed to rename log directory from 
> /tmp/kafka-logs0/0-0
>  to 
> /tmp/kafka-logs0/0-0.797bba3fb2464729840f87769243edbb-delete
>   at kafka.log.LogManager.asyncDelete(LogManager.scala:439)
>   at 
> kafka.cluster.Partition$$anonfun$delete$1.apply$mcV$sp(Partition.scala:142)
>   at kafka.cluster.Partition$$anonfun$delete$1.apply(Partition.scala:137)
>   at kafka.cluster.Partition$$anonfun$delete$1.apply(Partition.scala:137)
>   at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:213)
>   at kafka.utils.CoreUtils$.inWriteLock(CoreUtils.scala:221)
>   at kafka.cluster.Partition.delete(Partition.scala:137)
>   at kafka.server.ReplicaManager.stopReplica(ReplicaManager.scala:230)
>   at 
> kafka.server.ReplicaManager$$anonfun$stopReplicas$2.apply(ReplicaManager.scala:260)
>   at 
> kafka.server.ReplicaManager$$anonfun$stopReplicas$2.apply(ReplicaManager.scala:259)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at kafka.server.ReplicaManager.stopReplicas(ReplicaManager.scala:259)
>   at kafka.server.KafkaApis.handleStopReplicaRequest(KafkaApis.scala:174)
>   at kafka.server.KafkaApis.handle(KafkaApis.scala:86)
>   at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:64)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> The topic after this point still exists but has Leader set to -1 and the 
> controller recognizes the topic completion as incomplete (the topic znode is 
> still in /admin/delete_topics).
> I don't believe linkedin has any topic name this long but I'm making the 
> ticket in case anyone runs into this problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-4893) async topic deletion conflicts with max topic length

2018-10-02 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-4893:

Fix Version/s: (was: 2.1.0)
   2.2.0

> async topic deletion conflicts with max topic length
> 
>
> Key: KAFKA-4893
> URL: https://issues.apache.org/jira/browse/KAFKA-4893
> Project: Kafka
>  Issue Type: Bug
>Reporter: Onur Karaman
>Assignee: Vahid Hashemian
>Priority: Minor
> Fix For: 2.2.0
>
>
> As per the 
> [documentation|http://kafka.apache.org/documentation/#basic_ops_add_topic], 
> topics can be only 249 characters long to line up with typical filesystem 
> limitations:
> {quote}
> Each sharded partition log is placed into its own folder under the Kafka log 
> directory. The name of such folders consists of the topic name, appended by a 
> dash (\-) and the partition id. Since a typical folder name can not be over 
> 255 characters long, there will be a limitation on the length of topic names. 
> We assume the number of partitions will not ever be above 100,000. Therefore, 
> topic names cannot be longer than 249 characters. This leaves just enough 
> room in the folder name for a dash and a potentially 5 digit long partition 
> id.
> {quote}
> {{kafka.common.Topic.maxNameLength}} is set to 249 and is used during 
> validation.
> This limit ends up not being quite right since topic deletion ends up 
> renaming the directory to the form {{topic-partition.uniqueId-delete}} as can 
> be seen in {{LogManager.asyncDelete}}:
> {code}
> val dirName = new StringBuilder(removedLog.name)
>   .append(".")
>   
> .append(java.util.UUID.randomUUID.toString.replaceAll("-",""))
>   .append(Log.DeleteDirSuffix)
>   .toString()
> {code}
> So the unique id and "-delete" suffix end up hogging some of the characters. 
> Deleting a long-named topic results in a log message such as the following:
> {code}
> kafka.common.KafkaStorageException: Failed to rename log directory from 
> /tmp/kafka-logs0/0-0
>  to 
> /tmp/kafka-logs0/0-0.797bba3fb2464729840f87769243edbb-delete
>   at kafka.log.LogManager.asyncDelete(LogManager.scala:439)
>   at 
> kafka.cluster.Partition$$anonfun$delete$1.apply$mcV$sp(Partition.scala:142)
>   at kafka.cluster.Partition$$anonfun$delete$1.apply(Partition.scala:137)
>   at kafka.cluster.Partition$$anonfun$delete$1.apply(Partition.scala:137)
>   at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:213)
>   at kafka.utils.CoreUtils$.inWriteLock(CoreUtils.scala:221)
>   at kafka.cluster.Partition.delete(Partition.scala:137)
>   at kafka.server.ReplicaManager.stopReplica(ReplicaManager.scala:230)
>   at 
> kafka.server.ReplicaManager$$anonfun$stopReplicas$2.apply(ReplicaManager.scala:260)
>   at 
> kafka.server.ReplicaManager$$anonfun$stopReplicas$2.apply(ReplicaManager.scala:259)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at kafka.server.ReplicaManager.stopReplicas(ReplicaManager.scala:259)
>   at kafka.server.KafkaApis.handleStopReplicaRequest(KafkaApis.scala:174)
>   at kafka.server.KafkaApis.handle(KafkaApis.scala:86)
>   at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:64)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> The topic after this point still exists but has Leader set to -1 and the 
> controller recognizes the topic completion as incomplete (the topic znode is 
> still in /admin/delete_topics).
> I don't believe linkedin has any topic name this long but I'm making the 
> ticket in case anyone runs into this problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-6951) Implement offset expiration semantics for unsubscribed topics

2018-10-02 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-6951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-6951:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Implement offset expiration semantics for unsubscribed topics
> -
>
> Key: KAFKA-6951
> URL: https://issues.apache.org/jira/browse/KAFKA-6951
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Reporter: Vahid Hashemian
>Assignee: Vahid Hashemian
>Priority: Major
> Fix For: 2.2.0
>
>
> [This 
> portion|https://cwiki.apache.org/confluence/display/KAFKA/KIP-211%3A+Revise+Expiration+Semantics+of+Consumer+Group+Offsets#KIP-211:ReviseExpirationSemanticsofConsumerGroupOffsets-UnsubscribingfromaTopic]
>  of KIP-211 will be implemented separately from the main PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-6951) Implement offset expiration semantics for unsubscribed topics

2018-10-02 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-6951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16636523#comment-16636523
 ] 

Dong Lin commented on KAFKA-6951:
-

Move to 2.2.0 since PR is not ready.

> Implement offset expiration semantics for unsubscribed topics
> -
>
> Key: KAFKA-6951
> URL: https://issues.apache.org/jira/browse/KAFKA-6951
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Reporter: Vahid Hashemian
>Assignee: Vahid Hashemian
>Priority: Major
> Fix For: 2.2.0
>
>
> [This 
> portion|https://cwiki.apache.org/confluence/display/KAFKA/KIP-211%3A+Revise+Expiration+Semantics+of+Consumer+Group+Offsets#KIP-211:ReviseExpirationSemanticsofConsumerGroupOffsets-UnsubscribingfromaTopic]
>  of KIP-211 will be implemented separately from the main PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-3143) inconsistent state in ZK when all replicas are dead

2018-10-02 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-3143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-3143:

Fix Version/s: (was: 2.1.0)
   2.2.0

> inconsistent state in ZK when all replicas are dead
> ---
>
> Key: KAFKA-3143
> URL: https://issues.apache.org/jira/browse/KAFKA-3143
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jun Rao
>Assignee: Ismael Juma
>Priority: Major
>  Labels: reliability
> Fix For: 2.2.0
>
>
> This issue can be recreated in the following steps.
> 1. Start 3 brokers, 1, 2 and 3.
> 2. Create a topic with a single partition and 2 replicas, say on broker 1 and 
> 2.
> If we stop both replicas 1 and 2, depending on where the controller is, the 
> leader and isr stored in ZK in the end are different.
> If the controller is on broker 3, what's stored in ZK will be -1 for leader 
> and an empty set for ISR.
> On the other hand, if the controller is on broker 2 and we stop broker 1 
> followed by broker 2, what's stored in ZK will be 2 for leader and 2 for ISR.
> The issue is that in the first case, the controller will call 
> ReplicaStateMachine to transition to OfflineReplica, which will change the 
> leader and isr. However, in the second case, the controller fails over, but 
> we don't transition ReplicaStateMachine to OfflineReplica during controller 
> initialization.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-4133) Provide a configuration to control consumer max in-flight fetches

2018-10-02 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-4133:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Provide a configuration to control consumer max in-flight fetches
> -
>
> Key: KAFKA-4133
> URL: https://issues.apache.org/jira/browse/KAFKA-4133
> Project: Kafka
>  Issue Type: Improvement
>  Components: consumer
>Reporter: Jason Gustafson
>Assignee: Mickael Maison
>Priority: Major
> Fix For: 2.2.0
>
>
> With KIP-74, we now have a good way to limit the size of fetch responses, but 
> it may still be difficult for users to control overall memory since the 
> consumer will send fetches in parallel to all the brokers which own 
> partitions that it is subscribed to. To give users finer control, it might 
> make sense to add a `max.in.flight.fetches` setting to limit the total number 
> of concurrent fetches at any time. This would require a KIP since it's a new 
> configuration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-5795) Make the idempotent producer the default producer setting

2018-10-02 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-5795:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Make the idempotent producer the default producer setting
> -
>
> Key: KAFKA-5795
> URL: https://issues.apache.org/jira/browse/KAFKA-5795
> Project: Kafka
>  Issue Type: New Feature
>Reporter: Apurva Mehta
>Assignee: Apurva Mehta
>Priority: Major
> Fix For: 2.2.0
>
>
> We would like to turn on idempotence by default. The KIP is here: 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-185%3A+Make+exactly+once+in+order+delivery+per+partition+the+default+producer+setting



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-5284) Add tools and metrics to diagnose problems with the idempotent producer and transactions

2018-10-02 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-5284:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Add tools and metrics to diagnose problems with the idempotent producer and 
> transactions
> 
>
> Key: KAFKA-5284
> URL: https://issues.apache.org/jira/browse/KAFKA-5284
> Project: Kafka
>  Issue Type: Sub-task
>  Components: clients, core, producer 
>Reporter: Apurva Mehta
>Priority: Major
>  Labels: exactly-once
> Fix For: 2.2.0
>
>
> The KIP mentions a number of metrics which we should add, but haven't yet 
> done so. IT would also be good to have tools to help diagnose degenerate 
> situations like:
> # If a consumer is stuck, we should be able to find the LSO of the partition 
> it is blocked on, and which producer is holding up the advancement of the LSO.
> # We should be able to force abort any inflight transaction to free up 
> consumers
> # etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-6509) Add additional tests for validating store restoration completes before Topology is intitalized

2018-10-02 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-6509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-6509:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Add additional tests for validating store restoration completes before 
> Topology is intitalized
> --
>
> Key: KAFKA-6509
> URL: https://issues.apache.org/jira/browse/KAFKA-6509
> Project: Kafka
>  Issue Type: Test
>  Components: streams
>Reporter: Bill Bejeck
>Priority: Major
>  Labels: newbie
> Fix For: 2.2.0
>
>
> Right now 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-6463) Review logging level for user errors in AdminManager

2018-10-02 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-6463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-6463:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Review logging level for user errors in AdminManager
> 
>
> Key: KAFKA-6463
> URL: https://issues.apache.org/jira/browse/KAFKA-6463
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Rajini Sivaram
>Priority: Major
> Fix For: 2.2.0
>
>
> AdminManager currently logs errors due to bad requests at INFO level (e.g. 
> alter configs with bad value). In other components, I think we only log user 
> errors are either not logged or logged at a lower logging level. We should 
> review logging in AdminManager.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-5284) Add tools and metrics to diagnose problems with the idempotent producer and transactions

2018-10-02 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-5284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16636520#comment-16636520
 ] 

Dong Lin commented on KAFKA-5284:
-

Move to 2.2.0 since PR is not ready.

 

> Add tools and metrics to diagnose problems with the idempotent producer and 
> transactions
> 
>
> Key: KAFKA-5284
> URL: https://issues.apache.org/jira/browse/KAFKA-5284
> Project: Kafka
>  Issue Type: Sub-task
>  Components: clients, core, producer 
>Reporter: Apurva Mehta
>Priority: Major
>  Labels: exactly-once
> Fix For: 2.1.0
>
>
> The KIP mentions a number of metrics which we should add, but haven't yet 
> done so. IT would also be good to have tools to help diagnose degenerate 
> situations like:
> # If a consumer is stuck, we should be able to find the LSO of the partition 
> it is blocked on, and which producer is holding up the advancement of the LSO.
> # We should be able to force abort any inflight transaction to free up 
> consumers
> # etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-6463) Review logging level for user errors in AdminManager

2018-10-02 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-6463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16636519#comment-16636519
 ] 

Dong Lin commented on KAFKA-6463:
-

Move to 2.2.0 since PR is not ready.

> Review logging level for user errors in AdminManager
> 
>
> Key: KAFKA-6463
> URL: https://issues.apache.org/jira/browse/KAFKA-6463
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Rajini Sivaram
>Priority: Major
> Fix For: 2.2.0
>
>
> AdminManager currently logs errors due to bad requests at INFO level (e.g. 
> alter configs with bad value). In other components, I think we only log user 
> errors are either not logged or logged at a lower logging level. We should 
> review logging in AdminManager.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-5606) Review consumer's RequestFuture usage pattern

2018-10-02 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-5606:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Review consumer's RequestFuture usage pattern
> -
>
> Key: KAFKA-5606
> URL: https://issues.apache.org/jira/browse/KAFKA-5606
> Project: Kafka
>  Issue Type: Bug
>Reporter: Ismael Juma
>Assignee: james chien
>Priority: Major
> Fix For: 2.2.0
>
>
> KAFKA-5556 shows that we can perhaps tighten the usage pattern of the 
> consumer's RequestFuture to avoid similar bugs in the future.
> Jason suggested:
> {quote}
> Another way to see this bug is a failure to ensure completion of the future. 
> Had we done so, then we could have skipped the failed check. This is why it 
> worked prior to the patch which added the timeout. The pattern should really 
> be something like this:
> {code}
> if (future.isDone()) {
>   if (future.succeeded()) {
> // handle success
>   } else {
> // handle failure
>   }
> }
> {code}
> I guess one benefit of the enum approach is that it forces you to ensure 
> completion prior to checking any of the possible results. That said, I'm a 
> bit more inclined to remove the isRetriable method and leave it to the caller 
> to determine what is and is not retriable. Then the request future only has 
> two completion states.
> {quote}
> An alternative is replacing succeeded and failed with a status method 
> returning an enum with 3 states: SUCCEEDED, FAILED, RETRY (the enum approach 
> mentioned above). This would make sense if we often have to handle these 3 
> states differently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-5780) Long shutdown time when updated to 0.11.0

2018-10-02 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-5780:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Long shutdown time when updated to 0.11.0
> -
>
> Key: KAFKA-5780
> URL: https://issues.apache.org/jira/browse/KAFKA-5780
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.11.0.0
> Environment: CentOS Linux release 7.3.1611 , Kernel 3.10
>Reporter: Raoufeh Hashemian
>Assignee: Apurva Mehta
>Priority: Major
> Fix For: 2.2.0
>
> Attachments: broker_shutdown.png, shutdown.log, 
> shutdown_controller.log, shutdown_statechange.log
>
>
> When we switched from Kafka 0.10.2 to Kafka 0.11.0 , We faced a problem with 
> stopping the kafka service on a broker node.
> Our cluster consists of 6 broker nodes. We had an existing topic when 
> switched to Kafka 0.11.0 . Since then, gracefully stoping the service on a 
> Kafka broker node results in the following warning message being repeated 
> every 100 ms in the broker log, and the shutdown takes approximately 45 
> minutes to complete.
> {code:java}
> @4000599714da1e582e4c [2017-08-18 16:24:48,509] WARN Connection to node 
> 1002 could not be established. Broker may not be available. 
> (org.apache.kafka.clients.NetworkClient)
> @4000599714da245483a4 [2017-08-18 16:24:48,609] WARN Connection to node 
> 1002 could not be established. Broker may not be available. 
> (org.apache.kafka.clients.NetworkClient)
> @4000599714da2a51177c [2017-08-18 16:24:48,709] WARN Connection to node 
> 1002 could not be established. Broker may not be available. 
> (org.apache.kafka.clients.NetworkClient)
> {code}
> Below is the last log lines when the shutdown is complete :
> {code:java}
> @400059971afd31113dbc [2017-08-18 16:50:59,823] WARN Connection to node 
> 1002 could not be established. Broker may not be available. 
> (org.apache.kafka.clients.NetworkClient)
> @400059971afd361200bc [2017-08-18 16:50:59,907] INFO Shutdown complete. 
> (kafka.log.LogManager)
> @400059971afd36afa04c [2017-08-18 16:50:59,917] INFO Terminate ZkClient 
> event thread. (org.I0Itec.zkclient.ZkEventThread)
> @400059971afd36dd6edc [2017-08-18 16:50:59,920] INFO Session: 
> 0x35d68c9e76702a4 closed (org.apache.zookeeper.ZooKeeper)
> @400059971afd36deca84 [2017-08-18 16:50:59,920] INFO EventThread shut 
> down for session: 0x35d68c9e76702a4 (org.apache.zookeeper.ClientCnxn)
> @400059971afd36f6afb4 [2017-08-18 16:50:59,922] INFO [Kafka Server 1002], 
> shut down completed (kafka.server.KafkaServer)
> {code}
> I should note that I stopped the producers before shutting down the broker.
> If I repeat the process after brining up the service, the shutdown takes less 
> than a minute. However, if I start the producers even for a short time and 
> repeat the process, it will again take around 45 minutes to do a graceful 
> shutdown.
> Attached files shows the brokers CPU usage during the shutdown period (light 
> blue curve is the node in which the broker service is shutting down).
> The size of the topic is 2.3 TB per broker.
> I was wondering if this is an expected behaviour or It is a bug or a 
> misconfiguration? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-5780) Long shutdown time when updated to 0.11.0

2018-10-02 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-5780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16636518#comment-16636518
 ] 

Dong Lin commented on KAFKA-5780:
-

Move to 2.2.0 since PR is not ready.

> Long shutdown time when updated to 0.11.0
> -
>
> Key: KAFKA-5780
> URL: https://issues.apache.org/jira/browse/KAFKA-5780
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.11.0.0
> Environment: CentOS Linux release 7.3.1611 , Kernel 3.10
>Reporter: Raoufeh Hashemian
>Assignee: Apurva Mehta
>Priority: Major
> Fix For: 2.2.0
>
> Attachments: broker_shutdown.png, shutdown.log, 
> shutdown_controller.log, shutdown_statechange.log
>
>
> When we switched from Kafka 0.10.2 to Kafka 0.11.0 , We faced a problem with 
> stopping the kafka service on a broker node.
> Our cluster consists of 6 broker nodes. We had an existing topic when 
> switched to Kafka 0.11.0 . Since then, gracefully stoping the service on a 
> Kafka broker node results in the following warning message being repeated 
> every 100 ms in the broker log, and the shutdown takes approximately 45 
> minutes to complete.
> {code:java}
> @4000599714da1e582e4c [2017-08-18 16:24:48,509] WARN Connection to node 
> 1002 could not be established. Broker may not be available. 
> (org.apache.kafka.clients.NetworkClient)
> @4000599714da245483a4 [2017-08-18 16:24:48,609] WARN Connection to node 
> 1002 could not be established. Broker may not be available. 
> (org.apache.kafka.clients.NetworkClient)
> @4000599714da2a51177c [2017-08-18 16:24:48,709] WARN Connection to node 
> 1002 could not be established. Broker may not be available. 
> (org.apache.kafka.clients.NetworkClient)
> {code}
> Below is the last log lines when the shutdown is complete :
> {code:java}
> @400059971afd31113dbc [2017-08-18 16:50:59,823] WARN Connection to node 
> 1002 could not be established. Broker may not be available. 
> (org.apache.kafka.clients.NetworkClient)
> @400059971afd361200bc [2017-08-18 16:50:59,907] INFO Shutdown complete. 
> (kafka.log.LogManager)
> @400059971afd36afa04c [2017-08-18 16:50:59,917] INFO Terminate ZkClient 
> event thread. (org.I0Itec.zkclient.ZkEventThread)
> @400059971afd36dd6edc [2017-08-18 16:50:59,920] INFO Session: 
> 0x35d68c9e76702a4 closed (org.apache.zookeeper.ZooKeeper)
> @400059971afd36deca84 [2017-08-18 16:50:59,920] INFO EventThread shut 
> down for session: 0x35d68c9e76702a4 (org.apache.zookeeper.ClientCnxn)
> @400059971afd36f6afb4 [2017-08-18 16:50:59,922] INFO [Kafka Server 1002], 
> shut down completed (kafka.server.KafkaServer)
> {code}
> I should note that I stopped the producers before shutting down the broker.
> If I repeat the process after brining up the service, the shutdown takes less 
> than a minute. However, if I start the producers even for a short time and 
> repeat the process, it will again take around 45 minutes to do a graceful 
> shutdown.
> Attached files shows the brokers CPU usage during the shutdown period (light 
> blue curve is the node in which the broker service is shutting down).
> The size of the topic is 2.3 TB per broker.
> I was wondering if this is an expected behaviour or It is a bug or a 
> misconfiguration? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-5482) A CONCURRENT_TRANASCTIONS error for the first AddPartitionsToTxn request slows down transactions significantly

2018-10-02 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-5482:

Fix Version/s: (was: 2.1.0)
   2.2.0

> A CONCURRENT_TRANASCTIONS error for the first AddPartitionsToTxn request 
> slows down transactions significantly
> --
>
> Key: KAFKA-5482
> URL: https://issues.apache.org/jira/browse/KAFKA-5482
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.11.0.0
>Reporter: Apurva Mehta
>Assignee: Apurva Mehta
>Priority: Major
> Fix For: 2.2.0
>
>
> Here is the issue.
> # When we do a commit transaction, the producer sends an `EndTxn` request to 
> the coordinator. The coordinator writes the `PrepareCommit` message to the 
> transaction log and then returns the response the client. It writes the 
> transaction markers and the final 'CompleteCommit' message asynchronously. 
> # In the mean time, if the client starts another transaction, it will send an 
> `AddPartitions` request on the next `Sender.run` loop. If the markers haven't 
> been written yet, then the coordinator will return a retriable 
> `CONCURRENT_TRANSACTIONS` error to the client.
> # The current behavior in the producer is to sleep for `retryBackoffMs` 
> before retrying the request. The current default for this is 100ms. So the 
> producer will sleep for 100ms before sending the `AddPartitions` again. This 
> puts a floor on the latency for back to back transactions.
> This has been worked around in 
> https://issues.apache.org/jira/browse/KAFKA-5477 by reducing the retryBackoff 
> for the first AddPartitions request. But we need a stronger solution: like 
> having the commit block until the transaction is complete, or delaying the 
> addPartitions until batches are actually ready to be sent to the transaction.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-5356) Producer with transactionalId should be able to send outside of a transaction

2018-10-02 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-5356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16636515#comment-16636515
 ] 

Dong Lin commented on KAFKA-5356:
-

Move to 2.2.0 since PR is not ready.

> Producer with transactionalId should be able to send outside of a transaction
> -
>
> Key: KAFKA-5356
> URL: https://issues.apache.org/jira/browse/KAFKA-5356
> Project: Kafka
>  Issue Type: Sub-task
>  Components: clients, core, producer 
>Reporter: Jason Gustafson
>Priority: Major
> Fix For: 2.2.0
>
>
> This is debatable, but it may be useful to allow a producer to send 
> non-trasactional data even if they have a transactionalId set. At the moment, 
> this is explicitly forbidden.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-5356) Producer with transactionalId should be able to send outside of a transaction

2018-10-02 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-5356:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Producer with transactionalId should be able to send outside of a transaction
> -
>
> Key: KAFKA-5356
> URL: https://issues.apache.org/jira/browse/KAFKA-5356
> Project: Kafka
>  Issue Type: Sub-task
>  Components: clients, core, producer 
>Reporter: Jason Gustafson
>Priority: Major
> Fix For: 2.2.0
>
>
> This is debatable, but it may be useful to allow a producer to send 
> non-trasactional data even if they have a transactionalId set. At the moment, 
> this is explicitly forbidden.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-5482) A CONCURRENT_TRANASCTIONS error for the first AddPartitionsToTxn request slows down transactions significantly

2018-10-02 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-5482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16636514#comment-16636514
 ] 

Dong Lin commented on KAFKA-5482:
-

Move to 2.2.0 since PR is not ready.

> A CONCURRENT_TRANASCTIONS error for the first AddPartitionsToTxn request 
> slows down transactions significantly
> --
>
> Key: KAFKA-5482
> URL: https://issues.apache.org/jira/browse/KAFKA-5482
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.11.0.0
>Reporter: Apurva Mehta
>Assignee: Apurva Mehta
>Priority: Major
> Fix For: 2.2.0
>
>
> Here is the issue.
> # When we do a commit transaction, the producer sends an `EndTxn` request to 
> the coordinator. The coordinator writes the `PrepareCommit` message to the 
> transaction log and then returns the response the client. It writes the 
> transaction markers and the final 'CompleteCommit' message asynchronously. 
> # In the mean time, if the client starts another transaction, it will send an 
> `AddPartitions` request on the next `Sender.run` loop. If the markers haven't 
> been written yet, then the coordinator will return a retriable 
> `CONCURRENT_TRANSACTIONS` error to the client.
> # The current behavior in the producer is to sleep for `retryBackoffMs` 
> before retrying the request. The current default for this is 100ms. So the 
> producer will sleep for 100ms before sending the `AddPartitions` again. This 
> puts a floor on the latency for back to back transactions.
> This has been worked around in 
> https://issues.apache.org/jira/browse/KAFKA-5477 by reducing the retryBackoff 
> for the first AddPartitions request. But we need a stronger solution: like 
> having the commit block until the transaction is complete, or delaying the 
> addPartitions until batches are actually ready to be sent to the transaction.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-5870) Idempotent producer: a producerId reset causes undesirable behavior for inflight batches to other partitions

2018-10-02 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-5870:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Idempotent producer: a producerId reset causes undesirable behavior for 
> inflight batches to other partitions
> 
>
> Key: KAFKA-5870
> URL: https://issues.apache.org/jira/browse/KAFKA-5870
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.11.0.0
>Reporter: Apurva Mehta
>Assignee: Apurva Mehta
>Priority: Major
>  Labels: exactly-once
> Fix For: 2.2.0
>
>
> Currently, if we have to reset the producer id for any reason (for instance 
> if batches to a partition get expired, if we get an 
> {{OutOfOrderSequenceException}}, etc) we could cause batches to other 
> --healthy-- partitions to fail with a spurious 
> {{OutOfOrderSequenceException}}.
> This is detailed in this PR discussion: 
> https://github.com/apache/kafka/pull/3743#discussion_r137907630
> Ideally, we would want all inflight batches to be handled to completion 
> rather than potentially failing them prematurely. Further, since we want to 
> tighten up the semantics of the {{OutOfOrderSequenceException}}, at the very 
> least we should raise another exception in this case, because there is no 
> data loss on the broker when the client gives up. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-5870) Idempotent producer: a producerId reset causes undesirable behavior for inflight batches to other partitions

2018-10-02 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-5870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16635781#comment-16635781
 ] 

Dong Lin commented on KAFKA-5870:
-

Moving this to 2.2.0 since PR is not ready yet.

> Idempotent producer: a producerId reset causes undesirable behavior for 
> inflight batches to other partitions
> 
>
> Key: KAFKA-5870
> URL: https://issues.apache.org/jira/browse/KAFKA-5870
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.11.0.0
>Reporter: Apurva Mehta
>Assignee: Apurva Mehta
>Priority: Major
>  Labels: exactly-once
> Fix For: 2.2.0
>
>
> Currently, if we have to reset the producer id for any reason (for instance 
> if batches to a partition get expired, if we get an 
> {{OutOfOrderSequenceException}}, etc) we could cause batches to other 
> --healthy-- partitions to fail with a spurious 
> {{OutOfOrderSequenceException}}.
> This is detailed in this PR discussion: 
> https://github.com/apache/kafka/pull/3743#discussion_r137907630
> Ideally, we would want all inflight batches to be handled to completion 
> rather than potentially failing them prematurely. Further, since we want to 
> tighten up the semantics of the {{OutOfOrderSequenceException}}, at the very 
> least we should raise another exception in this case, because there is no 
> data loss on the broker when the client gives up. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-5116) Controller updates to ISR holds the controller lock for a very long time

2018-10-02 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-5116:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Controller updates to ISR holds the controller lock for a very long time
> 
>
> Key: KAFKA-5116
> URL: https://issues.apache.org/jira/browse/KAFKA-5116
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 0.10.1.0, 0.10.2.0
>Reporter: Justin Downing
>Priority: Major
> Fix For: 2.2.0
>
>
> Hello!
> Lately, we have noticed slow (or no) results when monitoring the broker's ISR 
> using JMX. Many of these requests appear to be 'hung' for a very long time 
> (eg: >2m). We've dug a bunch, and found that in our case, sometimes the 
> controllerLock can be held for multiple minutes in the IsrChangeNotifier 
> callback.
> Inside the lock, we are reading from Zookeeper for *each* partition in the 
> changeset. With a large changeset (eg: >500 partitions), [this 
> operation|https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/controller/KafkaController.scala#L1347]
>  can take a long time to complete. 
> In KAFKA-2406, throttling was introduced to prevent overwhelming the 
> controller with many changesets at once. However, this does not take into 
> consideration _large_ changesets.
> We have identified two potential remediations we'd like to discuss further:
> * Move the Zookeeper request outside of the lock. This would then only lock 
> for the controller update and processing of the changeset.
> * Send limited changesets to Zookeeper when calling the 
> maybePropagateIsrChanges. When dealing with lots of partitions (eg: >1000) it 
> may be useful to batch the changesets in groups of 100 rather the send the 
> [entire 
> list|https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/server/ReplicaManager.scala#L204]
>  to Zookeeper at once.
> We're happy working on patches for either or both of these, but we are unsure 
> of the safety around these two proposals. Specifically, moving the Zookeeper 
> request out of the lock may be unsafe.
> Holding these locks for long periods of time seems problematic - it means 
> that broker failure won't be detected and acted upon quickly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-5116) Controller updates to ISR holds the controller lock for a very long time

2018-10-02 Thread Dong Lin (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-5116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16635780#comment-16635780
 ] 

Dong Lin commented on KAFKA-5116:
-

Moving this to 2.2.0 since PR is not ready yet.

> Controller updates to ISR holds the controller lock for a very long time
> 
>
> Key: KAFKA-5116
> URL: https://issues.apache.org/jira/browse/KAFKA-5116
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 0.10.1.0, 0.10.2.0
>Reporter: Justin Downing
>Priority: Major
> Fix For: 2.2.0
>
>
> Hello!
> Lately, we have noticed slow (or no) results when monitoring the broker's ISR 
> using JMX. Many of these requests appear to be 'hung' for a very long time 
> (eg: >2m). We've dug a bunch, and found that in our case, sometimes the 
> controllerLock can be held for multiple minutes in the IsrChangeNotifier 
> callback.
> Inside the lock, we are reading from Zookeeper for *each* partition in the 
> changeset. With a large changeset (eg: >500 partitions), [this 
> operation|https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/controller/KafkaController.scala#L1347]
>  can take a long time to complete. 
> In KAFKA-2406, throttling was introduced to prevent overwhelming the 
> controller with many changesets at once. However, this does not take into 
> consideration _large_ changesets.
> We have identified two potential remediations we'd like to discuss further:
> * Move the Zookeeper request outside of the lock. This would then only lock 
> for the controller update and processing of the changeset.
> * Send limited changesets to Zookeeper when calling the 
> maybePropagateIsrChanges. When dealing with lots of partitions (eg: >1000) it 
> may be useful to batch the changesets in groups of 100 rather the send the 
> [entire 
> list|https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/server/ReplicaManager.scala#L204]
>  to Zookeeper at once.
> We're happy working on patches for either or both of these, but we are unsure 
> of the safety around these two proposals. Specifically, moving the Zookeeper 
> request out of the lock may be unsafe.
> Holding these locks for long periods of time seems problematic - it means 
> that broker failure won't be detected and acted upon quickly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-5453) Controller may miss requests sent to the broker when zk session timeout happens.

2018-10-02 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-5453:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Controller may miss requests sent to the broker when zk session timeout 
> happens.
> 
>
> Key: KAFKA-5453
> URL: https://issues.apache.org/jira/browse/KAFKA-5453
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.11.0.0
>Reporter: Jiangjie Qin
>Priority: Major
> Fix For: 2.2.0
>
>
> The issue I encountered was the following:
> 1. Partition reassignment was in progress, one replica of a partition is 
> being reassigned from broker 1 to broker 2.
> 2. Controller received an ISR change notification which indicates broker 2 
> has caught up.
> 3. Controller was sending StopReplicaRequest to broker 1.
> 4. Broker 1 zk session timeout occurs. Controller removed broker 1 from the 
> cluster and cleaned up the queue. i.e. the StopReplicaRequest was removed 
> from the ControllerChannelManager.
> 5. Broker 1 reconnected to zk and act as if it is still a follower replica of 
> the partition. 
> 6. Broker 1 will always receive exception from the leader because it is not 
> in the replica list.
> Not sure what is the correct fix here. It seems that broke 1 in this case 
> should ask the controller for the latest replica assignment.
> There are two related bugs:
> 1. when a {{NotAssignedReplicaException}} is thrown from 
> {{Partition.updateReplicaLogReadResult()}}, the other partitions in the same 
> request will failed to update the fetch timestamp and offset and thus also 
> drop out of the ISR.
> 2. The {{NotAssignedReplicaException}} was not properly returned to the 
> replicas, instead, a UnknownServerException is returned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-5453) Controller may miss requests sent to the broker when zk session timeout happens.

2018-10-02 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-5453:

Fix Version/s: (was: 2.2.0)
   2.1.0

> Controller may miss requests sent to the broker when zk session timeout 
> happens.
> 
>
> Key: KAFKA-5453
> URL: https://issues.apache.org/jira/browse/KAFKA-5453
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.11.0.0
>Reporter: Jiangjie Qin
>Priority: Major
> Fix For: 2.1.0
>
>
> The issue I encountered was the following:
> 1. Partition reassignment was in progress, one replica of a partition is 
> being reassigned from broker 1 to broker 2.
> 2. Controller received an ISR change notification which indicates broker 2 
> has caught up.
> 3. Controller was sending StopReplicaRequest to broker 1.
> 4. Broker 1 zk session timeout occurs. Controller removed broker 1 from the 
> cluster and cleaned up the queue. i.e. the StopReplicaRequest was removed 
> from the ControllerChannelManager.
> 5. Broker 1 reconnected to zk and act as if it is still a follower replica of 
> the partition. 
> 6. Broker 1 will always receive exception from the leader because it is not 
> in the replica list.
> Not sure what is the correct fix here. It seems that broke 1 in this case 
> should ask the controller for the latest replica assignment.
> There are two related bugs:
> 1. when a {{NotAssignedReplicaException}} is thrown from 
> {{Partition.updateReplicaLogReadResult()}}, the other partitions in the same 
> request will failed to update the fetch timestamp and offset and thus also 
> drop out of the ISR.
> 2. The {{NotAssignedReplicaException}} was not properly returned to the 
> replicas, instead, a UnknownServerException is returned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-5601) Refactor ReassignPartitionsCommand to use AdminClient

2018-10-02 Thread Dong Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated KAFKA-5601:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Refactor ReassignPartitionsCommand to use AdminClient
> -
>
> Key: KAFKA-5601
> URL: https://issues.apache.org/jira/browse/KAFKA-5601
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Tom Bentley
>Assignee: Tom Bentley
>Priority: Major
>  Labels: kip
> Fix For: 2.2.0
>
>
> Currently the {{ReassignPartitionsCommand}} (used by 
> {{kafka-reassign-partitions.sh}}) talks directly to ZooKeeper. It would be 
> better to have it use the AdminClient API instead. 
> This would entail creating two new protocol APIs, one to initiate the request 
> and another to request the status of an in-progress reassignment. As such 
> this would require a KIP.
> This touches on the work of KIP-166, but that proposes to use the 
> {{ReassignPartitionsCommand}} API, so should not be affected so long as that 
> API is maintained. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


<    1   2   3   4   5   6   7   >