[jira] [Updated] (KAFKA-13965) Document broker-side socket-server-metrics

2022-06-07 Thread James Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Cheng updated KAFKA-13965:

Labels: newbie newbie++  (was: )

> Document broker-side socket-server-metrics
> --
>
> Key: KAFKA-13965
> URL: https://issues.apache.org/jira/browse/KAFKA-13965
> Project: Kafka
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 3.2.0
>Reporter: James Cheng
>Priority: Major
>  Labels: newbie, newbie++
>
> There are a bunch of broker JMX metrics in the "socket-server-metrics" space 
> that are not documented on kafka.apache.org/documentation
>  
>  * {_}MBean{_}: 
> kafka.server:{{{}type=socket-server-metrics,listener=,networkProcessor={}}}
>  ** From KIP-188: 
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-188+-+Add+new+metrics+to+support+health+checks]
>  *  
> kafka.server:type=socket-server-metrics,name=connection-accept-rate,listener=\{listenerName}
>  ** From KIP-612: 
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-612%3A+Ability+to+Limit+Connection+Creation+Rate+on+Brokers]
> It would be helpful to get all the socket-server-metrics documented
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (KAFKA-13965) Document broker-side socket-server-metrics

2022-06-07 Thread James Cheng (Jira)
James Cheng created KAFKA-13965:
---

 Summary: Document broker-side socket-server-metrics
 Key: KAFKA-13965
 URL: https://issues.apache.org/jira/browse/KAFKA-13965
 Project: Kafka
  Issue Type: Improvement
  Components: documentation
Affects Versions: 3.2.0
Reporter: James Cheng


There are a bunch of broker JMX metrics in the "socket-server-metrics" space 
that are not documented on kafka.apache.org/documentation

 
 * {_}MBean{_}: 
kafka.server:{{{}type=socket-server-metrics,listener=,networkProcessor={}}}
 ** From KIP-188: 
[https://cwiki.apache.org/confluence/display/KAFKA/KIP-188+-+Add+new+metrics+to+support+health+checks]
 *  
kafka.server:type=socket-server-metrics,name=connection-accept-rate,listener=\{listenerName}
 ** From KIP-612: 
[https://cwiki.apache.org/confluence/display/KAFKA/KIP-612%3A+Ability+to+Limit+Connection+Creation+Rate+on+Brokers]

It would be helpful to get all the socket-server-metrics documented

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (KAFKA-9062) Handle stalled writes to RocksDB

2021-10-21 Thread James Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17432797#comment-17432797
 ] 

James Cheng commented on KAFKA-9062:


[~vvcephei], you said that in 2.6, you removed "bulk loading". Is it this JIRA? 
https://issues.apache.org/jira/browse/KAFKA-10005

> Handle stalled writes to RocksDB
> 
>
> Key: KAFKA-9062
> URL: https://issues.apache.org/jira/browse/KAFKA-9062
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: A. Sophie Blee-Goldman
>Priority: Major
>
> RocksDB may stall writes at times when background compactions or flushes are 
> having trouble keeping up. This means we can effectively end up blocking 
> indefinitely during a StateStore#put call within Streams, and may get kicked 
> from the group if the throttling does not ease up within the max poll 
> interval.
> Example: when restoring large amounts of state from scratch, we use the 
> strategy recommended by RocksDB of turning off automatic compactions and 
> dumping everything into L0. We do batch somewhat, but do not sort these small 
> batches before loading into the db, so we end up with a large number of 
> unsorted L0 files.
> When restoration is complete and we toggle the db back to normal (not bulk 
> loading) settings, a background compaction is triggered to merge all these 
> into the next level. This background compaction can take a long time to merge 
> unsorted keys, especially when the amount of data is quite large.
> Any new writes while the number of L0 files exceeds the max will be stalled 
> until the compaction can finish, and processing after restoring from scratch 
> can block beyond the polling interval



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9062) Handle stalled writes to RocksDB

2021-10-21 Thread James Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17432794#comment-17432794
 ] 

James Cheng commented on KAFKA-9062:


Are there any metrics that we can measure and look at, to see if we are being 
impacted by the issue? Anything in RocksDb or Kafka streams? Maybe PUT latency? 
Thanks. 

> Handle stalled writes to RocksDB
> 
>
> Key: KAFKA-9062
> URL: https://issues.apache.org/jira/browse/KAFKA-9062
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: A. Sophie Blee-Goldman
>Priority: Major
>
> RocksDB may stall writes at times when background compactions or flushes are 
> having trouble keeping up. This means we can effectively end up blocking 
> indefinitely during a StateStore#put call within Streams, and may get kicked 
> from the group if the throttling does not ease up within the max poll 
> interval.
> Example: when restoring large amounts of state from scratch, we use the 
> strategy recommended by RocksDB of turning off automatic compactions and 
> dumping everything into L0. We do batch somewhat, but do not sort these small 
> batches before loading into the db, so we end up with a large number of 
> unsorted L0 files.
> When restoration is complete and we toggle the db back to normal (not bulk 
> loading) settings, a background compaction is triggered to merge all these 
> into the next level. This background compaction can take a long time to merge 
> unsorted keys, especially when the amount of data is quite large.
> Any new writes while the number of L0 files exceeds the max will be stalled 
> until the compaction can finish, and processing after restoring from scratch 
> can block beyond the polling interval



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-10232) MirrorMaker2 internal topics Formatters

2021-02-17 Thread James Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-10232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17286270#comment-17286270
 ] 

James Cheng commented on KAFKA-10232:
-

This was implemented in https://github.com/apache/kafka/pull/8604

> MirrorMaker2 internal topics Formatters
> ---
>
> Key: KAFKA-10232
> URL: https://issues.apache.org/jira/browse/KAFKA-10232
> Project: Kafka
>  Issue Type: Improvement
>  Components: KafkaConnect, mirrormaker
>Reporter: Mickael Maison
>Assignee: Mickael Maison
>Priority: Major
> Fix For: 2.7.0
>
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-597%3A+MirrorMaker2+internal+topics+Formatters



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-8522) Tombstones can survive forever

2021-01-15 Thread James Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266396#comment-17266396
 ] 

James Cheng commented on KAFKA-8522:


The pull request has been moved to [https://github.com/apache/kafka/pull/9915]

 

> Tombstones can survive forever
> --
>
> Key: KAFKA-8522
> URL: https://issues.apache.org/jira/browse/KAFKA-8522
> Project: Kafka
>  Issue Type: Improvement
>  Components: log cleaner
>Reporter: Evelyn Bayes
>Priority: Minor
>
> This is a bit grey zone as to whether it's a "bug" but it is certainly 
> unintended behaviour.
>  
> Under specific conditions tombstones effectively survive forever:
>  * Small amount of throughput;
>  * min.cleanable.dirty.ratio near or at 0; and
>  * Other parameters at default.
> What  happens is all the data continuously gets cycled into the oldest 
> segment. Old records get compacted away, but the new records continuously 
> update the timestamp of the oldest segment reseting the countdown for 
> deleting tombstones.
> So tombstones build up in the oldest segment forever.
>  
> While you could "fix" this by reducing the segment size, this can be 
> undesirable as a sudden change in throughput could cause a dangerous number 
> of segments to be created.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-10801) Docs on configuration have multiple places using the same HTML anchor tag

2020-12-03 Thread James Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-10801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17243078#comment-17243078
 ] 

James Cheng commented on KAFKA-10801:
-

Thanks [~tombentley]!

 

I just looked it up. For future reference, this was fixed in 
https://github.com/apache/kafka/pull/8878

> Docs on configuration have multiple places using the same HTML anchor tag
> -
>
> Key: KAFKA-10801
> URL: https://issues.apache.org/jira/browse/KAFKA-10801
> Project: Kafka
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 2.6.0, 2.5.1
>Reporter: James Cheng
>Priority: Minor
> Fix For: 2.7.0
>
>
> The configuration option "compression.type" is a configuration option on the 
> Kafka Producer as well as on the Kafka brokers.
>  
> The same HTML anchor #compression.type is used on both of those entries. So 
> if you click or bookmark the link 
> [http://kafka.apache.org/documentation/#compression.type] , it will always 
> bring you to the first entry (the broker-side config). It will never bring 
> you to the 2nd entry (producer config).
>  
> I've only noticed this for the compression.type config, but it is possible 
> that it also applies to any other config option that is the same between the 
> broker/producer/consumer.
>  
> We should at least fix it for compression.type, and we should possibly fix it 
> across the entire document.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-8522) Tombstones can survive forever

2020-12-02 Thread James Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242921#comment-17242921
 ] 

James Cheng commented on KAFKA-8522:


Hi, we are also experiencing this behavior. I was wondering what the status was 
on this JIRA. I saw on the dev mailing list that KIP-534 has been voted on and 
approved by the community. Thanks!

> Tombstones can survive forever
> --
>
> Key: KAFKA-8522
> URL: https://issues.apache.org/jira/browse/KAFKA-8522
> Project: Kafka
>  Issue Type: Improvement
>  Components: log cleaner
>Reporter: Evelyn Bayes
>Priority: Minor
>
> This is a bit grey zone as to whether it's a "bug" but it is certainly 
> unintended behaviour.
>  
> Under specific conditions tombstones effectively survive forever:
>  * Small amount of throughput;
>  * min.cleanable.dirty.ratio near or at 0; and
>  * Other parameters at default.
> What  happens is all the data continuously gets cycled into the oldest 
> segment. Old records get compacted away, but the new records continuously 
> update the timestamp of the oldest segment reseting the countdown for 
> deleting tombstones.
> So tombstones build up in the oldest segment forever.
>  
> While you could "fix" this by reducing the segment size, this can be 
> undesirable as a sudden change in throughput could cause a dangerous number 
> of segments to be created.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-10801) Docs on configuration have multiple places using the same HTML anchor tag

2020-12-02 Thread James Cheng (Jira)
James Cheng created KAFKA-10801:
---

 Summary: Docs on configuration have multiple places using the same 
HTML anchor tag
 Key: KAFKA-10801
 URL: https://issues.apache.org/jira/browse/KAFKA-10801
 Project: Kafka
  Issue Type: Improvement
  Components: documentation
Affects Versions: 2.5.1, 2.6.0
Reporter: James Cheng


The configuration option "compression.type" is a configuration option on the 
Kafka Producer as well as on the Kafka brokers.

 

The same HTML anchor #compression.type is used on both of those entries. So if 
you click or bookmark the link 
[http://kafka.apache.org/documentation/#compression.type] , it will always 
bring you to the first entry (the broker-side config). It will never bring you 
to the 2nd entry (producer config).

 

I've only noticed this for the compression.type config, but it is possible that 
it also applies to any other config option that is the same between the 
broker/producer/consumer.

 

We should at least fix it for compression.type, and we should possibly fix it 
across the entire document.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-10509) Add metric to track throttle time due to hitting connection rate quota

2020-10-09 Thread James Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-10509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211484#comment-17211484
 ] 

James Cheng commented on KAFKA-10509:
-

Can we also update the documentation at 
[http://kafka.apache.org/documentation/#monitoring] to describe the new metric?

 

> Add metric to track throttle time due to hitting connection rate quota
> --
>
> Key: KAFKA-10509
> URL: https://issues.apache.org/jira/browse/KAFKA-10509
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Reporter: Anna Povzner
>Assignee: Anna Povzner
>Priority: Major
> Fix For: 2.7.0
>
>
> See KIP-612.
>  
> kafka.network:type=socket-server-metrics,name=connection-accept-throttle-time,listener=\{listenerName}
>  * Type: SampledStat.Avg
>  * Description: Average throttle time due to violating per-listener or 
> broker-wide connection acceptance rate quota on a given listener.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-7447) Consumer offsets lost during leadership rebalance after bringing node back from clean shutdown

2020-10-08 Thread James Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210470#comment-17210470
 ] 

James Cheng commented on KAFKA-7447:


Hi. The consensus seems to be that this was fixed in 
https://issues.apache.org/jira/browse/KAFKA-8896 

Should we update the Status, Resolution, and Fixed fields to correspond to 
https://issues.apache.org/jira/browse/KAFKA-8896 ?

 

> Consumer offsets lost during leadership rebalance after bringing node back 
> from clean shutdown
> --
>
> Key: KAFKA-7447
> URL: https://issues.apache.org/jira/browse/KAFKA-7447
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 1.1.1, 2.0.0
>Reporter: Ben Isaacs
>Priority: Major
>
> *Summary:*
>  * When 1 of my 3 brokers is cleanly shut down, consumption and production 
> continues as normal due to replication. (Consumers are rebalanced to the 
> replicas, and producers are rebalanced to the remaining brokers). However, 
> when the cleanly-shut-down broker comes back, after about 10 minutes, a 
> flurry of production errors occur and my consumers suddenly go back in time 2 
> weeks, causing a long outage (12 hours+) as all messages are replayed on some 
> topics.
>  * The hypothesis is that the auto-leadership-rebalance is happening too 
> quickly after the downed broker returns, before it has had a chance to become 
> fully synchronised on all partitions. In particular, it seems that having 
> consumer offets ahead of the most recent data on the topic that consumer was 
> following causes the consumer to be reset to 0.
> *Expected:*
>  * bringing a node back from a clean shut down does not cause any consumers 
> to reset to 0.
> *Actual:*
>  * I experience approximately 12 hours of partial outage triggered at the 
> point that auto leadership rebalance occurs, after a cleanly shut down node 
> returns.
> *Workaround:*
>  * disable auto leadership rebalance entirely. 
>  * manually rebalance it from time to time when all nodes and all partitions 
> are fully replicated.
> *My Setup:*
>  * Kafka deployment with 3 brokers and 2 topics.
>  * Replication factor is 3, for all topics.
>  * min.isr is 2, for all topics.
>  * Zookeeper deployment with 3 instances.
>  * In the region of 10 to 15 consumers, with 2 user topics (and, of course, 
> the system topics such as consumer offsets). Consumer offsets has the 
> standard 50 partitions. The user topics have about 3000 partitions in total.
>  * Offset retention time of 7 days, and topic retention time of 14 days.
>  * Input rate ~1000 messages/sec.
>  * Deployment happens to be on Google compute engine.
> *Related Stack Overflow Post:*
> https://stackoverflow.com/questions/52367825/apache-kafka-loses-some-consumer-offsets-when-when-i-bounce-a-broker
> It was suggested I open a ticket by "Muir" who says he they have also 
> experienced this.
> *Transcription of logs, showing the problem:*
> Below, you can see chronologically sorted, interleaved, logs from the 3 
> brokers. prod-kafka-2 is the node which was cleanly shut down and then 
> restarted. I filtered the messages only to those regardling 
> __consumer_offsets-29 because it's just too much to paste, otherwise.
> ||Broker host||Broker ID||
> |prod-kafka-1|0|
> |prod-kafka-2|1 (this one was restarted)|
> |prod-kafka-3|2|
> prod-kafka-2: (just starting up)
> {code}
> [2018-09-17 09:21:46,246] WARN [ReplicaFetcher replicaId=1, leaderId=2, 
> fetcherId=0] Based on follower's leader epoch, leader replied with an unknown 
> offset in __consumer_offsets-29. The initial fetch offset 0 will be used for 
> truncation. (kafka.server.ReplicaFetcherThread)
> {code}
> prod-kafka-3: (sees replica1 come back)
> {code}
> [2018-09-17 09:22:02,027] INFO [Partition __consumer_offsets-29 broker=2] 
> Expanding ISR from 0,2 to 0,2,1 (kafka.cluster.Partition)
> {code}
> prod-kafka-2:
> {code}
> [2018-09-17 09:22:33,892] INFO [GroupMetadataManager brokerId=1] Scheduling 
> unloading of offsets and group metadata from __consumer_offsets-29 
> (kafka.coordinator.group.GroupMetadataManager)
>  [2018-09-17 09:22:33,902] INFO [GroupMetadataManager brokerId=1] Finished 
> unloading __consumer_offsets-29. Removed 0 cached offsets and 0 cached 
> groups. (kafka.coordinator.group.GroupMetadataManager)
>  [2018-09-17 09:24:03,287] INFO [ReplicaFetcherManager on broker 1] Removed 
> fetcher for partitions __consumer_offsets-29 
> (kafka.server.ReplicaFetcherManager)
>  [2018-09-17 09:24:03,287] INFO [Partition __consumer_offsets-29 broker=1] 
> __consumer_offsets-29 starts at Leader Epoch 78 from offset 0. Previous 
> Leader Epoch was: 77 (kafka.cluster.Partition)
>  [2018-09-17 09:24:03,287] INFO [GroupMetadataManager brokerId=1] Scheduling 
> loading of offsets and group metadata from 

[jira] [Assigned] (KAFKA-8653) Regression in JoinGroup v0 rebalance timeout handling

2020-09-22 Thread James Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Cheng reassigned KAFKA-8653:
--

Assignee: James Cheng  (was: Jason Gustafson)

> Regression in JoinGroup v0 rebalance timeout handling
> -
>
> Key: KAFKA-8653
> URL: https://issues.apache.org/jira/browse/KAFKA-8653
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.3.0
>Reporter: Jason Gustafson
>Assignee: James Cheng
>Priority: Blocker
> Fix For: 2.3.1
>
>
> The rebalance timeout was added to the JoinGroup protocol in version 1. Prior 
> to 2.3, we handled version 0 JoinGroup requests by setting the rebalance 
> timeout to be equal to the session timeout. We lost this logic when we 
> converted the API to use the generated schema definition which uses the 
> default value of -1. The impact of this is that the group rebalance timeout 
> becomes 0, so rebalances finish immediately after we enter the 
> PrepareRebalance state and kick out all old members. This causes consumer 
> groups to enter an endless rebalance loop.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KAFKA-8653) Regression in JoinGroup v0 rebalance timeout handling

2020-09-22 Thread James Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Cheng reassigned KAFKA-8653:
--

Assignee: (was: James Cheng)

> Regression in JoinGroup v0 rebalance timeout handling
> -
>
> Key: KAFKA-8653
> URL: https://issues.apache.org/jira/browse/KAFKA-8653
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.3.0
>Reporter: Jason Gustafson
>Priority: Blocker
> Fix For: 2.3.1
>
>
> The rebalance timeout was added to the JoinGroup protocol in version 1. Prior 
> to 2.3, we handled version 0 JoinGroup requests by setting the rebalance 
> timeout to be equal to the session timeout. We lost this logic when we 
> converted the API to use the generated schema definition which uses the 
> default value of -1. The impact of this is that the group rebalance timeout 
> becomes 0, so rebalances finish immediately after we enter the 
> PrepareRebalance state and kick out all old members. This causes consumer 
> groups to enter an endless rebalance loop.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KAFKA-10473) Website is missing docs on JMX metrics for partition size-on-disk (kafka.log:type=Log,name=*)

2020-09-09 Thread James Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-10473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Cheng reassigned KAFKA-10473:
---

Assignee: James Cheng

> Website is missing docs on JMX metrics for partition size-on-disk 
> (kafka.log:type=Log,name=*)
> -
>
> Key: KAFKA-10473
> URL: https://issues.apache.org/jira/browse/KAFKA-10473
> Project: Kafka
>  Issue Type: Improvement
>  Components: docs, documentation
>Affects Versions: 2.6.0, 2.5.1
>Reporter: James Cheng
>Assignee: James Cheng
>Priority: Minor
>
> The website is missing docs on the following JMX metrics:
> kafka.log,type=Log,name=Size
> kafka.log,type=Log,name=NumLogSegments
> kafka.log,type=Log,name=LogStartOffset
> kafka.log,type=Log,name=LogEndOffset
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-10473) Website is missing docs on JMX metrics for partition size-on-disk (kafka.log:type=Log,name=*)

2020-09-09 Thread James Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-10473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Cheng updated KAFKA-10473:

Summary: Website is missing docs on JMX metrics for partition size-on-disk 
(kafka.log:type=Log,name=*)  (was: Website is missing docs on JMX metrics for 
partition size-on-disk)

> Website is missing docs on JMX metrics for partition size-on-disk 
> (kafka.log:type=Log,name=*)
> -
>
> Key: KAFKA-10473
> URL: https://issues.apache.org/jira/browse/KAFKA-10473
> Project: Kafka
>  Issue Type: Improvement
>  Components: docs, documentation
>Affects Versions: 2.6.0, 2.5.1
>Reporter: James Cheng
>Priority: Minor
>
> The website is missing docs on the following JMX metrics:
> kafka.log,type=Log,name=Size
> kafka.log,type=Log,name=NumLogSegments
> kafka.log,type=Log,name=LogStartOffset
> kafka.log,type=Log,name=LogEndOffset
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-10473) Website is missing docs on JMX metrics for partition size-on-disk

2020-09-09 Thread James Cheng (Jira)
James Cheng created KAFKA-10473:
---

 Summary: Website is missing docs on JMX metrics for partition 
size-on-disk
 Key: KAFKA-10473
 URL: https://issues.apache.org/jira/browse/KAFKA-10473
 Project: Kafka
  Issue Type: Improvement
  Components: docs, documentation
Affects Versions: 2.5.1, 2.6.0
Reporter: James Cheng


The website is missing docs on the following JMX metrics:

kafka.log,type=Log,name=Size

kafka.log,type=Log,name=NumLogSegments

kafka.log,type=Log,name=LogStartOffset

kafka.log,type=Log,name=LogEndOffset

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-8764) LogCleanerManager endless loop while compacting/cleaning segments

2020-09-02 Thread James Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189628#comment-17189628
 ] 

James Cheng commented on KAFKA-8764:


[~junrao], in the comment 
https://issues.apache.org/jira/browse/KAFKA-8764?focusedCommentId=17013081=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17013081
 above, you said
{quote}Currently, the log cleaner runs independently on each replica. If a 
follower has been down for some time, it is possible that the leader has 
already cleaned the data and left holes in some log segments. When those 
segments get replicated to the follower, the follower will clean the same data 
again and potentially hit the above issue.
{quote}
 

Would this also happen if you added a new follower? For example, if I have a 
compacted topic on Brokers 1 2 3 where compaction has already happened, and 
then I use kafka-reassign-partitions to move the brokers 4 5 6, it sounds like 
compaction will happen independently on brokers 4 5 6, and will run into this 
issue.

> LogCleanerManager endless loop while compacting/cleaning segments
> -
>
> Key: KAFKA-8764
> URL: https://issues.apache.org/jira/browse/KAFKA-8764
> Project: Kafka
>  Issue Type: Bug
>  Components: log cleaner
>Affects Versions: 2.3.0, 2.2.1, 2.4.0
> Environment: docker base image: openjdk:8-jre-alpine base image, 
> kafka from http://ftp.carnet.hr/misc/apache/kafka/2.2.1/kafka_2.12-2.2.1.tgz
>Reporter: Tomislav Rajakovic
>Priority: Major
>  Labels: patch
> Fix For: 2.2.3, 2.5.0, 2.3.2, 2.4.1
>
> Attachments: Screen Shot 2020-01-10 at 8.38.25 AM.png, 
> kafka2.4.0-KAFKA-8764.patch, kafka2.4.0-KAFKA-8764.patch, 
> log-cleaner-bug-reproduction.zip
>
>
> {{LogCleanerManager stuck in endless loop while clearing segments for one 
> partition resulting with many log outputs and heavy disk read/writes/IOPS.}}
>  
> Issue appeared on follower brokers, and it happens on every (new) broker if 
> partition assignment is changed.
>  
> Original issue setup:
>  * kafka_2.12-2.2.1 deployed as statefulset on kubernetes, 5 brokers
>  * log directory is (AWS) EBS mounted PV, gp2 (ssd) kind of 750GiB
>  * 5 zookeepers
>  * topic created with config:
>  ** name = "backup_br_domain_squad"
> partitions = 36
> replication_factor = 3
> config = {
>  "cleanup.policy" = "compact"
>  "min.compaction.lag.ms" = "8640"
>  "min.cleanable.dirty.ratio" = "0.3"
> }
>  
>  
> Log excerpt:
> {{[2019-08-07 12:10:53,895] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:53,895] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,896] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,896] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:53,964] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,031] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted offset index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.index.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,032] INFO Deleted time index 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.timeindex.deleted.
>  (kafka.log.LogSegment)}}
> {{[2019-08-07 12:10:54,101] INFO [Log partition=backup_br_domain_squad-14, 
> dir=/var/lib/kafka/data/topics] Deleting segment 0 (kafka.log.Log)}}
> {{[2019-08-07 12:10:54,101] INFO Deleted log 
> /var/lib/kafka/data/topics/backup_br_domain_squad-14/.log.deleted.
>  

[jira] [Commented] (KAFKA-10105) Regression in group coordinator dealing with flaky clients joining while leaving

2020-06-13 Thread James Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-10105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134694#comment-17134694
 ] 

James Cheng commented on KAFKA-10105:
-

Would any of you be able to provide a repro case? Or steps to reproduce?

> Regression in group coordinator dealing with flaky clients joining while 
> leaving
> 
>
> Key: KAFKA-10105
> URL: https://issues.apache.org/jira/browse/KAFKA-10105
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.4.1
> Environment: Kafka 1.1.0 on jre 8 on debian 9 in docker
> Kafka 2.4.1 on jre 11 on debian 9 in docker
>Reporter: William Reynolds
>Priority: Major
>
> Since upgrade of a cluster from 1.1.0 to 2.4.1 the broker no longer deals 
> correctly with a consumer sending a join after a leave correctly.
> What happens no is that if a consumer sends a leaving then follows up by 
> trying to send a join again as it is shutting down the group coordinator adds 
> the leaving member to the group but never seems to heartbeat that member.
> Since the consumer is then gone when it joins again after starting it is 
> added as a new member but the zombie member is there and is included in the 
> partition assignment which means that those partitions never get consumed 
> from. What can also happen is that one of the zombies gets group leader so 
> rebalance gets stuck forever and the group is entirely blocked.
> I have not been able to track down where this got introduced between 1.1.0 
> and 2.4.1 but I will look further into this. Unfortunately the logs are 
> essentially silent about the zombie mebers and I only had INFO level logging 
> on during the issue and by stopping all the consumers in the group and 
> restarting the broker coordinating that group we could get back to a working 
> state.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-10105) Regression in group coordinator dealing with flaky clients joining while leaving

2020-06-08 Thread James Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-10105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128005#comment-17128005
 ] 

James Cheng commented on KAFKA-10105:
-

Is this related to https://issues.apache.org/jira/browse/KAFKA-9935 ?

> Regression in group coordinator dealing with flaky clients joining while 
> leaving
> 
>
> Key: KAFKA-10105
> URL: https://issues.apache.org/jira/browse/KAFKA-10105
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.4.1
> Environment: Kafka 1.1.0 on jre 8 on debian 9 in docker
> Kafka 2.4.1 on jre 11 on debian 9 in docker
>Reporter: William Reynolds
>Priority: Major
>
> Since upgrade of a cluster from 1.1.0 to 2.4.1 the broker no longer deals 
> correctly with a consumer sending a join after a leave correctly.
> What happens no is that if a consumer sends a leaving then follows up by 
> trying to send a join again as it is shutting down the group coordinator adds 
> the leaving member to the group but never seems to heartbeat that member.
> Since the consumer is then gone when it joins again after starting it is 
> added as a new member but the zombie member is there and is included in the 
> partition assignment which means that those partitions never get consumed 
> from. What can also happen is that one of the zombies gets group leader so 
> rebalance gets stuck forever and the group is entirely blocked.
> I have not been able to track down where this got introduced between 1.1.0 
> and 2.4.1 but I will look further into this. Unfortunately the logs are 
> essentially silent about the zombie mebers and I only had INFO level logging 
> on during the issue and by stopping all the consumers in the group and 
> restarting the broker coordinating that group we could get back to a working 
> state.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9935) Kafka not releasing member from Consumer Group

2020-06-08 Thread James Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128007#comment-17128007
 ] 

James Cheng commented on KAFKA-9935:


Is this related to https://issues.apache.org/jira/browse/KAFKA-10105 ?

> Kafka not releasing member from Consumer Group
> --
>
> Key: KAFKA-9935
> URL: https://issues.apache.org/jira/browse/KAFKA-9935
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.4.1
> Environment: Linux
>Reporter: Steve Kecskes
>Priority: Major
>
> Hello. I am experiencing an issue where Kafka is not releasing members from a 
> consumer group when the member crashes. The consumer group is then stuck in 
> rebalancing state indefinitely.
> In this consumer group, there is only 1 client. This client has the following 
> related settings:
> {code:java}
> auto.commit.interval.ms = 5000
>  auto.offset.reset = earliest
>  bootstrap.servers = [austgkafka01.hk.eclipseoptions.com:9092]
>  check.crcs = true
>  client.dns.lookup = default
>  client.id = TraderAutomationViewServer_workaround_stuck_rebalance_20200427-0
>  connections.max.idle.ms = 54
>  default.api.timeout.ms = 6
>  enable.auto.commit = true
>  exclude.internal.topics = true
>  fetch.max.bytes = 52428800
>  fetch.max.wait.ms = 500
>  fetch.min.bytes = 1
>  group.id = TraderAutomationViewServer_workaround_stuck_rebalance_20200427
>  heartbeat.interval.ms = 3000
>  interceptor.classes = []
>  internal.leave.group.on.close = true
>  isolation.level = read_uncommitted
>  key.deserializer = class 
> org.apache.kafka.common.serialization.ByteArrayDeserializer
>  max.partition.fetch.bytes = 1048576
>  max.poll.interval.ms = 30
>  max.poll.records = 1
>  metadata.max.age.ms = 30
>  metric.reporters = []
>  metrics.num.samples = 2
>  metrics.recording.level = INFO
>  metrics.sample.window.ms = 3
>  partition.assignment.strategy = [class 
> org.apache.kafka.clients.consumer.RangeAssignor]
>  receive.buffer.bytes = 16777216
>  reconnect.backoff.max.ms = 1000
>  reconnect.backoff.ms = 50
>  request.timeout.ms = 3
>  retry.backoff.ms = 100
>  sasl.client.callback.handler.class = null
>  sasl.jaas.config = null
>  sasl.kerberos.kinit.cmd = /usr/bin/kinit
>  sasl.kerberos.min.time.before.relogin = 6
>  sasl.kerberos.service.name = null
>  sasl.kerberos.ticket.renew.jitter = 0.05
>  sasl.kerberos.ticket.renew.window.factor = 0.8
>  sasl.login.callback.handler.class = null
>  sasl.login.class = null
>  sasl.login.refresh.buffer.seconds = 300
>  sasl.login.refresh.min.period.seconds = 60
>  sasl.login.refresh.window.factor = 0.8
>  sasl.login.refresh.window.jitter = 0.05
>  sasl.mechanism = GSSAPI
>  security.protocol = PLAINTEXT
>  send.buffer.bytes = 131072
>  session.timeout.ms = 1
>  ssl.cipher.suites = null
>  ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
>  ssl.endpoint.identification.algorithm = https
>  ssl.key.password = null
>  ssl.keymanager.algorithm = SunX509
>  ssl.keystore.location = null
>  ssl.keystore.password = null
>  ssl.keystore.type = JKS
>  ssl.protocol = TLS
>  ssl.provider = null
>  ssl.secure.random.implementation = null
>  ssl.trustmanager.algorithm = PKIX
>  ssl.truststore.location = null
>  ssl.truststore.password = null
>  ssl.truststore.type = JKS
>  value.deserializer = class 
> org.apache.kafka.common.serialization.ByteArrayDeserializer
> {code}
> If the client crashes (not a graceful exit from group) the group remains in 
> the following state indefinitely.
> {code}
> Warning: Consumer group 
> 'TraderAutomationViewServer_workaround_stuck_rebalance' is rebalancing.
> GROUP TOPIC 
> PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG CONSUMER-ID 
> HOSTCLIENT-ID
> TraderAutomationViewServer_workaround_stuck_rebalance EventAdjustedVols 10
>  6984061 7839599 855538  -   -
>-
> TraderAutomationViewServer_workaround_stuck_rebalance VolMetrics8 
>  128459531   143736443   15276912-   -
>-
> TraderAutomationViewServer_workaround_stuck_rebalance EventAdjustedVols 12
>  7216495 8106030 889535  -   -
>-
> TraderAutomationViewServer_workaround_stuck_rebalance VolMetrics6 
>  122921729   137377358   14455629-   -
>-
> TraderAutomationViewServer_workaround_stuck_rebalance EventAdjustedVols 14
>  5457618 6171142 713524  -   -
>-
> TraderAutomationViewServer_workaround_stuck_rebalance VolMetrics4 
>  125647891   140542566   14894675-   -
>   

[jira] [Commented] (KAFKA-8180) Deleting large number of topics can block the controller for the time it takes to delete all of them

2020-06-06 Thread James Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127196#comment-17127196
 ] 

James Cheng commented on KAFKA-8180:


Do we know if this has been fixed in recent versions? I recall hearing about 
some improvements to topic deletion, but I'm not sure if they fixed this or not.

> Deleting large number of topics can block the controller for the time it 
> takes to delete all of them
> 
>
> Key: KAFKA-8180
> URL: https://issues.apache.org/jira/browse/KAFKA-8180
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Reporter: Gwen Shapira
>Priority: Major
>
> Scenario:
> - Create large number of topics (In my experiment: 400 topics with 12 
> partitions each )
> - Use the admin client to delete all of them in a single batch operation
> - Try to bounce another broker while this is going on
> As you can see from the logs and metrics - topic deletion happens 
> synchronously in the controller and it does not do anything else (leader 
> elections for instance) while it is busy deleting (which can take many 
> minutes for large batches).
> I recommend fixing it by throttling the deletes - no matter how large a batch 
> the client sent, the controller should delete a subset and complete a full 
> cycle before deleting the next subset.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9697) ControlPlaneNetworkProcessorAvgIdlePercent is always NaN

2020-04-06 Thread James Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076763#comment-17076763
 ] 

James Cheng commented on KAFKA-9697:


Oh, no, we don't have that setup. I had assumed that the 
ControlPlaneNetworkProcessorAvgIdlePercent metric would be present, even if 
both the control plane and the data plane shared the same listeners. But 
apparently that is not the case?

Is this purely a documentation issue then? Should we add text saying that the 
ControlPlaneNetworkProcessorAvgIdlePercent metric will be NaN if 
control.plane.listener.name is unset? (Related, I noticed that 
ControlPlaneNetworkProcessorAvgIdlePercent isn't documented on the website)

Or, should we programmatically hide the metric if control.plane.listener.name 
is not yet set?

Or, can we still calculate a value of 
ControlPlaneNetworkProcessorAvgIdlePercent when the control plane and data 
plane share a listener?

 

 

> ControlPlaneNetworkProcessorAvgIdlePercent is always NaN
> 
>
> Key: KAFKA-9697
> URL: https://issues.apache.org/jira/browse/KAFKA-9697
> Project: Kafka
>  Issue Type: Improvement
>  Components: metrics
>Affects Versions: 2.3.0
>Reporter: James Cheng
>Priority: Major
>
> I have a broker running Kafka 2.3.0. The value of 
> kafka.network:type=SocketServer,name=ControlPlaneNetworkProcessorAvgIdlePercent
>  is always "NaN".
> Is that normal, or is there a problem with the metric?
> I am running Kafka 2.3.0. I have not checked this in newer/older versions.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9389) Document how to use kafka-reassign-partitions.sh to change log dirs for a partition

2020-03-10 Thread James Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1705#comment-1705
 ] 

James Cheng commented on KAFKA-9389:


[~mitchellh], I think in order for Jira to auto-link your PR, the title of your 
PR must start with "KAFKA-9389 "

> Document how to use kafka-reassign-partitions.sh to change log dirs for a 
> partition
> ---
>
> Key: KAFKA-9389
> URL: https://issues.apache.org/jira/browse/KAFKA-9389
> Project: Kafka
>  Issue Type: Improvement
>Reporter: James Cheng
>Assignee: Mitchell
>Priority: Major
>  Labels: newbie
>
> KIP-113 introduced support for moving replicas between log directories. As 
> part of it, support was added to kafka-reassign-partitions.sh so that users 
> can move replicas between log directories. Specifically, when you call 
> "kafka-reassign-partitions.sh --topics-to-move-json-file 
> topics-to-move.json", you can specify a "log_dirs" key in the 
> topics-to-move.json file, and kafka-reassign-partitions.sh will then move 
> those replicas to those directories.
>  
> However, when working on that KIP, we didn't update the docs on 
> kafka.apache.org to describe how to use the new functionality. We should add 
> documentation on that.
>  
> I haven't used it before, but whoever works on this Jira can probably figure 
> it out by experimentation with kafka-reassign-partitions.sh, or by reading 
> KIP-113 page or the associated JIRAs.
>  * 
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-113%3A+Support+replicas+movement+between+log+directories]
>  * KAFKA-5163
>  * KAFKA-5694
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-9697) ControlPlaneNetworkProcessorAvgIdlePercent is always NaN

2020-03-10 Thread James Cheng (Jira)
James Cheng created KAFKA-9697:
--

 Summary: ControlPlaneNetworkProcessorAvgIdlePercent is always NaN
 Key: KAFKA-9697
 URL: https://issues.apache.org/jira/browse/KAFKA-9697
 Project: Kafka
  Issue Type: Improvement
  Components: metrics
Affects Versions: 2.3.0
Reporter: James Cheng


I have a broker running Kafka 2.3.0. The value of 
kafka.network:type=SocketServer,name=ControlPlaneNetworkProcessorAvgIdlePercent 
is always "NaN".

Is that normal, or is there a problem with the metric?

I am running Kafka 2.3.0. I have not checked this in newer/older versions.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-9696) Document the control plane metrics that were added in KIP-402

2020-03-10 Thread James Cheng (Jira)
James Cheng created KAFKA-9696:
--

 Summary: Document the control plane metrics that were added in 
KIP-402
 Key: KAFKA-9696
 URL: https://issues.apache.org/jira/browse/KAFKA-9696
 Project: Kafka
  Issue Type: Improvement
  Components: documentation
Affects Versions: 2.4.0, 2.3.0, 2.2.0
Reporter: James Cheng


KIP-402 (in https://issues.apache.org/jira/browse/KAFKA-7719) added new metrics 
of

 

kafka.network:type=SocketServer,name=ControlPlaneNetworkProcessorAvgIdlePercent

kafka.network:type=SocketServer,name=ControlPlaneExpiredConnectionsKilledCount

 

There is no documentation on these metrics on 
http://kafka.apache.org/documentation/. We should update the documentation to 
describe these new metrics.

 

I'm not 100% familiar with them, but it appears they are measuring the same 
thing as 

kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent

kafka.network:type=SocketServer,name=ExpiredConnectionsKilledCount

 

except for the control plane, instead of the data plane.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-4084) automated leader rebalance causes replication downtime for clusters with too many partitions

2020-02-21 Thread James Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042297#comment-17042297
 ] 

James Cheng commented on KAFKA-4084:


Hi
 
Just wanted to chime in with some of our experience on this front.
 
We've also encountered issues like this.
We run our brokers with roughly 10k partitions per broker.
 
We have auto.leader.rebalance.enable=true. It sometimes happens to us that a 
normal restart of a broker will cause our cluster to get into a bad state for 
1-2 hours.
 
During that time, what we've seen is: * Auto-leader rebalance triggers and 
moves a large number of leaders to/from the various brokers.
 * {color:#172b4d}The broker that is receiving the leaders takes a long time to 
assume leadership. If you do a metadata request to it, it will say it is not 
the leader for those partitions.{color}
 * {color:#172b4d}The brokers that are giving up leadership will quickly give 
up leadership. If you do a metadata request to them, they will say that they 
are not the leader for those partitions.{color}
 * {color:#172b4d}Clients will receive notification that leadership has 
transitioned. But since the clients will get their metadata from an arbitrary 
broker in the cluster, they will receive different metadata depending on which 
broker they contact. Regardless, they will trust that metadata and will attempt 
to fetch from the broker that they believe is leader. But at this point, all 
leaders are claiming that they are not the leader for partitions. So they will 
attempt a fetch, get an error, refetch metadata, and then try again. You get a 
thundering herd problem, where all clients are hammering the cluster.{color}
 * {color:#172b4d}Broker request queues will be busy attempting to assume 
leadership, do replication, as well as answer and reject to all these 
requests.{color}
 * {color:#172b4d}During this time, each broker will be reporting 
UnderReplicatedPartitions=0. The reason for that is, brokers will report an 
UnderReplicatedPartition if they are the leader for a partition, and there are 
followers that are not present. In this case, the brokers do not believe they 
are leaders for these partitions, and so will not report it as an 
UnderReplicatedPartition. And similarly, I believe that OfflinePartitions=0 as 
well. But from a user's point of view, these partitions are effectively 
inaccessible, because no one will serve traffic for them.{color}

 
{color:#172b4d}The metrics we've seen during this are:{color} * {color:#172b4d} 
ControllerQueueSize goes to maximum{color}
 * {color:#172b4d} RequestHandlerIdlePercent drops to 0%{color}
 * {color:#172b4d} (maybe) RequestQueueSize goes to maximum{color}
 * {color:#172b4d} TotalTimeMs for Produce request goes into 2ms 
range{color}
 * 
{color:#172b4d} TotalTimeMs for FetchFollower goes into 1000ms range (which is 
odd, because the default setting is 500ms){color}
 * 
{color:#172b4d} TotalTimeMs for FetchConsumer goes into 1000ms range (which is 
odd, because the default setting is 500ms){color}

 
{color:#172b4d}We've also seen similar behavior when we replace the hard drive 
on a broker and it has to re-replicate its entire contents. We don't yet 
definitively understand that one, yet.{color}
{color:#172b4d} {color}
{color:#172b4d}We've worked with Confluent on this. The advice from Confluent 
seems to be our brokers are simply underpowered for our situation. From the 
metrics point of view, the evidence that points to this is:{color} * 
{color:#172b4d} Normal host CPU utilization is 45-60%{color}
 * {color:#172b4d} During these times, High CPU utilization of 60-70%{color}
 * {color:#172b4d} High ResponseSendTimeMs during these times{color}
 * {color:#172b4d} NetworkProcessorIdlePercent 0%{color}

{color:#172b4d} {color}
>From our side, we're not sure *what* specifically is the thing that we're 
>underpowered *for*. Is it number of partitions? Number of clients + number of 
>retries? Amount of network data? All The Things? We're not sure.
{color:#172b4d} {color}
{color:#172b4d}One idea on how to mitigate it is to set {color}
{code:java}
auto.leader.rebalance.enable=false{code}
. And then to use an kafka-reassign-partitions.sh to move leaders just a few at 
a time, until leadership is rebalanced. 
 
{color:#172b4d}You can do something similar when re-populating a broker from 
scratch. You can remove the (empty) broker from all partitions, and then use 
kafka-reassign-partitions.sh to add the broker as a follower to a small number 
of partitions at a time. That lets you control how many partitions move at a 
time. kafka-reassign-partitions.sh also lets you specify throttling, so you can 
also use that. That lets you throttle network bandwidth. So at this point, you 
are in control of how *many* partitions are moving, as well as how much 
*network* they use. And then lastly, you can also control not just "become 
follower" but you can also decide 

[jira] [Commented] (KAFKA-9389) Document how to use kafka-reassign-partitions.sh to change log dirs for a partition

2020-01-28 Thread James Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025451#comment-17025451
 ] 

James Cheng commented on KAFKA-9389:


[~mitchellh], I noticed you closed this pull request. What's your status? Are 
you waiting for a code review approval at this point?

> Document how to use kafka-reassign-partitions.sh to change log dirs for a 
> partition
> ---
>
> Key: KAFKA-9389
> URL: https://issues.apache.org/jira/browse/KAFKA-9389
> Project: Kafka
>  Issue Type: Improvement
>Reporter: James Cheng
>Assignee: Mitchell
>Priority: Major
>  Labels: newbie
>
> KIP-113 introduced support for moving replicas between log directories. As 
> part of it, support was added to kafka-reassign-partitions.sh so that users 
> can move replicas between log directories. Specifically, when you call 
> "kafka-reassign-partitions.sh --topics-to-move-json-file 
> topics-to-move.json", you can specify a "log_dirs" key in the 
> topics-to-move.json file, and kafka-reassign-partitions.sh will then move 
> those replicas to those directories.
>  
> However, when working on that KIP, we didn't update the docs on 
> kafka.apache.org to describe how to use the new functionality. We should add 
> documentation on that.
>  
> I haven't used it before, but whoever works on this Jira can probably figure 
> it out by experimentation with kafka-reassign-partitions.sh, or by reading 
> KIP-113 page or the associated JIRAs.
>  * 
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-113%3A+Support+replicas+movement+between+log+directories]
>  * KAFKA-5163
>  * KAFKA-5694
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-9389) Document how to use kafka-reassign-partitions.sh to change log dirs for a partition

2020-01-08 Thread James Cheng (Jira)
James Cheng created KAFKA-9389:
--

 Summary: Document how to use kafka-reassign-partitions.sh to 
change log dirs for a partition
 Key: KAFKA-9389
 URL: https://issues.apache.org/jira/browse/KAFKA-9389
 Project: Kafka
  Issue Type: Improvement
Reporter: James Cheng


KIP-113 introduced support for moving replicas between log directories. As part 
of it, support was added to kafka-reassign-partitions.sh so that users can move 
replicas between log directories. Specifically, when you call 
"kafka-reassign-partitions.sh --topics-to-move-json-file topics-to-move.json", 
you can specify a "log_dirs" key in the topics-to-move.json file, and 
kafka-reassign-partitions.sh will then move those replicas to those directories.

 

However, when working on that KIP, we didn't update the docs on 
kafka.apache.org to describe how to use the new functionality. We should add 
documentation on that.

 

I haven't used it before, but whoever works on this Jira can probably figure it 
out by experimentation with kafka-reassign-partitions.sh, or by reading KIP-113 
page or the associated JIRAs.
 * 
[https://cwiki.apache.org/confluence/display/KAFKA/KIP-113%3A+Support+replicas+movement+between+log+directories]
 * KAFKA-5163
 * KAFKA-5694

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9133) LogCleaner thread dies with: currentLog cannot be empty on an unexpected exception

2019-11-11 Thread James Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16971825#comment-16971825
 ] 

James Cheng commented on KAFKA-9133:


Will there be a 2.3.2 release?

 

> LogCleaner thread dies with: currentLog cannot be empty on an unexpected 
> exception
> --
>
> Key: KAFKA-9133
> URL: https://issues.apache.org/jira/browse/KAFKA-9133
> Project: Kafka
>  Issue Type: Bug
>  Components: log cleaner
>Affects Versions: 2.3.1
>Reporter: Karolis Pocius
>Assignee: Jason Gustafson
>Priority: Blocker
> Fix For: 2.4.0, 2.3.2
>
>
> Log cleaner thread dies without a clear reference to which log is causing it:
> {code}
> [2019-11-02 11:59:59,078] INFO Starting the log cleaner (kafka.log.LogCleaner)
> [2019-11-02 11:59:59,144] INFO [kafka-log-cleaner-thread-0]: Starting 
> (kafka.log.LogCleaner)
> [2019-11-02 11:59:59,199] ERROR [kafka-log-cleaner-thread-0]: Error due to 
> (kafka.log.LogCleaner)
> java.lang.IllegalStateException: currentLog cannot be empty on an unexpected 
> exception
>  at kafka.log.LogCleaner$CleanerThread.cleanFilthiestLog(LogCleaner.scala:346)
>  at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:307)
>  at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:89)
> Caused by: java.lang.IllegalArgumentException: Illegal request for non-active 
> segments beginning at offset 5033130, which is larger than the active 
> segment's base offset 5019648
>  at kafka.log.Log.nonActiveLogSegmentsFrom(Log.scala:1933)
>  at 
> kafka.log.LogCleanerManager$.maxCompactionDelay(LogCleanerManager.scala:491)
>  at 
> kafka.log.LogCleanerManager.$anonfun$grabFilthiestCompactedLog$4(LogCleanerManager.scala:184)
>  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>  at scala.collection.immutable.List.foreach(List.scala:392)
>  at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>  at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>  at scala.collection.immutable.List.map(List.scala:298)
>  at 
> kafka.log.LogCleanerManager.$anonfun$grabFilthiestCompactedLog$1(LogCleanerManager.scala:181)
>  at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:253)
>  at 
> kafka.log.LogCleanerManager.grabFilthiestCompactedLog(LogCleanerManager.scala:171)
>  at kafka.log.LogCleaner$CleanerThread.cleanFilthiestLog(LogCleaner.scala:321)
>  ... 2 more
> [2019-11-02 11:59:59,200] INFO [kafka-log-cleaner-thread-0]: Stopped 
> (kafka.log.LogCleaner)
> {code}
> If I try to ressurect it by dynamically bumping {{log.cleaner.threads}} it 
> instantly dies with the exact same error.
> Not sure if this is something KAFKA-8725 is supposed to address.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-8745) DumpLogSegments doesn't show keys, when the message is null

2019-09-18 Thread James Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Cheng resolved KAFKA-8745.

Fix Version/s: 2.4.0
 Reviewer: Guozhang Wang
   Resolution: Fixed

> DumpLogSegments doesn't show keys, when the message is null
> ---
>
> Key: KAFKA-8745
> URL: https://issues.apache.org/jira/browse/KAFKA-8745
> Project: Kafka
>  Issue Type: Improvement
>  Components: tools
>Affects Versions: 2.3.0
>Reporter: James Cheng
>Assignee: James Cheng
>Priority: Major
> Fix For: 2.4.0
>
>
> When DumpLogSegments encounters a message with a message key, but no message 
> value, it doesn't print out the message key.
>  
> {noformat}
> $ ~/kafka_2.11-2.2.0/bin/kafka-run-class.sh kafka.tools.DumpLogSegments 
> --files compacted-0/.log --print-data-log
> Dumping compacted-0/.log
> Starting offset: 0
> baseOffset: 0 lastOffset: 3 count: 4 baseSequence: -1 lastSequence: -1 
> producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 0 isTransactional: 
> false isControl: false position: 0 CreateTime: 1564696640073 size: 113 magic: 
> 2 compresscodec: NONE crc: 206507478 isvalid: true
> | offset: 2 CreateTime: 1564696640073 keysize: 4 valuesize: -1 sequence: -1 
> headerKeys: []
> | offset: 3 CreateTime: 1564696640073 keysize: 4 valuesize: -1 sequence: -1 
> headerKeys: []
> {noformat}
> It should print out the message key.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-8813) Race condition when creating topics and changing their configuration

2019-08-23 Thread James Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Cheng updated KAFKA-8813:
---
Description: 
In Partition.createLog we do:
{code:java}
val config = LogConfig.fromProps(logManager.currentDefaultConfig.originals, 
props)
val log = logManager.getOrCreateLog(topicPartition, config, isNew, 
isFutureReplica)
{code}
https://github.com/apache/kafka/blob/33d06082117d971cdcddd4f01392006b543f3c01/core/src/main/scala/kafka/cluster/Partition.scala#L315-L316

Config changes that arrive after configs are loaded from ZK, but before 
LogManager added the partition to `futureLogs` or `currentLogs` where the 
dynamic config handlers picks up topics to update their configs, will be lost.

  was:
In Partition.createLog we do:

{{val config = LogConfig.fromProps(logManager.currentDefaultConfig.originals, 
props)
val log = logManager.getOrCreateLog(topicPartition, config, isNew, 
isFutureReplica)}}

Config changes that arrive after configs are loaded from ZK, but before 
LogManager added the partition to `futureLogs` or `currentLogs` where the 
dynamic config handlers picks up topics to update their configs, will be lost.


> Race condition when creating topics and changing their configuration
> 
>
> Key: KAFKA-8813
> URL: https://issues.apache.org/jira/browse/KAFKA-8813
> Project: Kafka
>  Issue Type: Bug
>Reporter: Gwen Shapira
>Priority: Major
>
> In Partition.createLog we do:
> {code:java}
> val config = LogConfig.fromProps(logManager.currentDefaultConfig.originals, 
> props)
> val log = logManager.getOrCreateLog(topicPartition, config, isNew, 
> isFutureReplica)
> {code}
> https://github.com/apache/kafka/blob/33d06082117d971cdcddd4f01392006b543f3c01/core/src/main/scala/kafka/cluster/Partition.scala#L315-L316
> Config changes that arrive after configs are loaded from ZK, but before 
> LogManager added the partition to `futureLogs` or `currentLogs` where the 
> dynamic config handlers picks up topics to update their configs, will be lost.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (KAFKA-8813) Race condition when creating topics and changing their configuration

2019-08-23 Thread James Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Cheng updated KAFKA-8813:
---
Description: 
In Partition.createLog we do:

{{val config = LogConfig.fromProps(logManager.currentDefaultConfig.originals, 
props)
val log = logManager.getOrCreateLog(topicPartition, config, isNew, 
isFutureReplica)}}

Config changes that arrive after configs are loaded from ZK, but before 
LogManager added the partition to `futureLogs` or `currentLogs` where the 
dynamic config handlers picks up topics to update their configs, will be lost.

  was:
In Partition.createLog we do:

{{val config = LogConfig.fromProps(logManager.currentDefaultConfig.originals, 
props)val log = logManager.getOrCreateLog(topicPartition, config, isNew, 
isFutureReplica)}}

Config changes that arrive after configs are loaded from ZK, but before 
LogManager added the partition to `futureLogs` or `currentLogs` where the 
dynamic config handlers picks up topics to update their configs, will be lost.


> Race condition when creating topics and changing their configuration
> 
>
> Key: KAFKA-8813
> URL: https://issues.apache.org/jira/browse/KAFKA-8813
> Project: Kafka
>  Issue Type: Bug
>Reporter: Gwen Shapira
>Priority: Major
>
> In Partition.createLog we do:
> {{val config = LogConfig.fromProps(logManager.currentDefaultConfig.originals, 
> props)
> val log = logManager.getOrCreateLog(topicPartition, config, isNew, 
> isFutureReplica)}}
> Config changes that arrive after configs are loaded from ZK, but before 
> LogManager added the partition to `futureLogs` or `currentLogs` where the 
> dynamic config handlers picks up topics to update their configs, will be lost.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (KAFKA-8808) Kafka Inconsistent Retention Period across partitions

2019-08-17 Thread James Cheng (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909626#comment-16909626
 ] 

James Cheng commented on KAFKA-8808:


How did you create your topics? We ran into something like that recently. We 
were creating the topic with no config, and then setting the configuration 
immediately afterwards. In some cases (network congestion?) the broker would 
receive the configuration *before* the topic creation request. When that 
happened, the broker would ignore the config (because it was for a non-existent 
topic) and then would receive the topic creation request (which has no config, 
and therefore would use broker defaults)

> Kafka Inconsistent Retention Period across partitions
> -
>
> Key: KAFKA-8808
> URL: https://issues.apache.org/jira/browse/KAFKA-8808
> Project: Kafka
>  Issue Type: Bug
>  Components: log, log cleaner
>Affects Versions: 1.0.0
>Reporter: Prashant
>Priority: Major
>
> Our topic is created with  retention period of 3 days.  Topic has four 
> partitions.  Broker level default is 12 hour. 
> Some partition's segment get deleted even before 3 days. Logs show that 
> segment is marked for deletion because it exceeded *4320ms*  = 12 hours. 
>  
> "INFO Found deletable segments with base offsets [43275] due to retention 
> time *4320ms* breach (kafka.log.Log)"
>  
> This does not happen for all partitions.  Post full cluster bounce , 
> partitions facing this issue change. 
>  
> *Topic config :* 
> Topic:TOPICNAME PartitionCount:4 ReplicationFactor:2 
> Configs:retention.ms=25920,segment.ms=4320
>  Topic: TOPICNAME Partition: 0 Leader: 1 Replicas: 1,8 Isr: 1,8
>  Topic: TOPICNAME Partition: 1 Leader: 2 Replicas: 2,9 Isr: 9,2
>  Topic: TOPICNAME Partition: 2 Leader: 3 Replicas: 3,1 Isr: 1,3
>  Topic: TOPICNAME Partition: 3 Leader: 4 Replicas: 4,2 Isr: 4,2
>  
> *Broker config :*
> log.retention.ms=4320
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (KAFKA-8745) DumpLogSegments doesn't show keys, when the message is null

2019-08-01 Thread James Cheng (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898440#comment-16898440
 ] 

James Cheng commented on KAFKA-8745:


[~guozhang]: This is ready for review. The instructions at 
[https://cwiki.apache.org/confluence/display/KAFKA/Contributing+Code+Changes] 
say to switch the Jira status to "PATCH AVAILABLE" but I can't figure out how 
to do that.

> DumpLogSegments doesn't show keys, when the message is null
> ---
>
> Key: KAFKA-8745
> URL: https://issues.apache.org/jira/browse/KAFKA-8745
> Project: Kafka
>  Issue Type: Improvement
>  Components: tools
>Affects Versions: 2.3.0
>Reporter: James Cheng
>Assignee: James Cheng
>Priority: Major
>
> When DumpLogSegments encounters a message with a message key, but no message 
> value, it doesn't print out the message key.
>  
> {noformat}
> $ ~/kafka_2.11-2.2.0/bin/kafka-run-class.sh kafka.tools.DumpLogSegments 
> --files compacted-0/.log --print-data-log
> Dumping compacted-0/.log
> Starting offset: 0
> baseOffset: 0 lastOffset: 3 count: 4 baseSequence: -1 lastSequence: -1 
> producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 0 isTransactional: 
> false isControl: false position: 0 CreateTime: 1564696640073 size: 113 magic: 
> 2 compresscodec: NONE crc: 206507478 isvalid: true
> | offset: 2 CreateTime: 1564696640073 keysize: 4 valuesize: -1 sequence: -1 
> headerKeys: []
> | offset: 3 CreateTime: 1564696640073 keysize: 4 valuesize: -1 sequence: -1 
> headerKeys: []
> {noformat}
> It should print out the message key.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (KAFKA-8745) DumpLogSegments doesn't show keys, when the message is null

2019-08-01 Thread James Cheng (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Cheng updated KAFKA-8745:
---
Flags: Patch

> DumpLogSegments doesn't show keys, when the message is null
> ---
>
> Key: KAFKA-8745
> URL: https://issues.apache.org/jira/browse/KAFKA-8745
> Project: Kafka
>  Issue Type: Improvement
>  Components: tools
>Affects Versions: 2.3.0
>Reporter: James Cheng
>Assignee: James Cheng
>Priority: Major
>
> When DumpLogSegments encounters a message with a message key, but no message 
> value, it doesn't print out the message key.
>  
> {noformat}
> $ ~/kafka_2.11-2.2.0/bin/kafka-run-class.sh kafka.tools.DumpLogSegments 
> --files compacted-0/.log --print-data-log
> Dumping compacted-0/.log
> Starting offset: 0
> baseOffset: 0 lastOffset: 3 count: 4 baseSequence: -1 lastSequence: -1 
> producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 0 isTransactional: 
> false isControl: false position: 0 CreateTime: 1564696640073 size: 113 magic: 
> 2 compresscodec: NONE crc: 206507478 isvalid: true
> | offset: 2 CreateTime: 1564696640073 keysize: 4 valuesize: -1 sequence: -1 
> headerKeys: []
> | offset: 3 CreateTime: 1564696640073 keysize: 4 valuesize: -1 sequence: -1 
> headerKeys: []
> {noformat}
> It should print out the message key.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (KAFKA-8745) DumpLogSegments doesn't show keys, when the message is null

2019-08-01 Thread James Cheng (JIRA)
James Cheng created KAFKA-8745:
--

 Summary: DumpLogSegments doesn't show keys, when the message is 
null
 Key: KAFKA-8745
 URL: https://issues.apache.org/jira/browse/KAFKA-8745
 Project: Kafka
  Issue Type: Improvement
  Components: tools
Affects Versions: 2.3.0
Reporter: James Cheng


When DumpLogSegments encounters a message with a message key, but no message 
value, it doesn't print out the message key.

 
{noformat}
$ ~/kafka_2.11-2.2.0/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files 
compacted-0/.log --print-data-log
Dumping compacted-0/.log
Starting offset: 0
baseOffset: 0 lastOffset: 3 count: 4 baseSequence: -1 lastSequence: -1 
producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 0 isTransactional: false 
isControl: false position: 0 CreateTime: 1564696640073 size: 113 magic: 2 
compresscodec: NONE crc: 206507478 isvalid: true
| offset: 2 CreateTime: 1564696640073 keysize: 4 valuesize: -1 sequence: -1 
headerKeys: []
| offset: 3 CreateTime: 1564696640073 keysize: 4 valuesize: -1 sequence: -1 
headerKeys: []
{noformat}
It should print out the message key.

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (KAFKA-8745) DumpLogSegments doesn't show keys, when the message is null

2019-08-01 Thread James Cheng (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Cheng reassigned KAFKA-8745:
--

Assignee: James Cheng

> DumpLogSegments doesn't show keys, when the message is null
> ---
>
> Key: KAFKA-8745
> URL: https://issues.apache.org/jira/browse/KAFKA-8745
> Project: Kafka
>  Issue Type: Improvement
>  Components: tools
>Affects Versions: 2.3.0
>Reporter: James Cheng
>Assignee: James Cheng
>Priority: Major
>
> When DumpLogSegments encounters a message with a message key, but no message 
> value, it doesn't print out the message key.
>  
> {noformat}
> $ ~/kafka_2.11-2.2.0/bin/kafka-run-class.sh kafka.tools.DumpLogSegments 
> --files compacted-0/.log --print-data-log
> Dumping compacted-0/.log
> Starting offset: 0
> baseOffset: 0 lastOffset: 3 count: 4 baseSequence: -1 lastSequence: -1 
> producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 0 isTransactional: 
> false isControl: false position: 0 CreateTime: 1564696640073 size: 113 magic: 
> 2 compresscodec: NONE crc: 206507478 isvalid: true
> | offset: 2 CreateTime: 1564696640073 keysize: 4 valuesize: -1 sequence: -1 
> headerKeys: []
> | offset: 3 CreateTime: 1564696640073 keysize: 4 valuesize: -1 sequence: -1 
> headerKeys: []
> {noformat}
> It should print out the message key.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (KAFKA-8634) Update ZooKeeper to 3.5.5

2019-07-11 Thread James Cheng (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883294#comment-16883294
 ] 

James Cheng commented on KAFKA-8634:


[~ijuma]: Prior to this PR, can I use Kafka to talk to Zookeeper instances 
running 3.5.5? (Meaning, I think, Kafka using Zookeeper 3.4.14 clients, talking 
to Zookeeper 3.5.5 server)

 

Or, is this PR required to even talk to a Zookeeper 3.5.5 server?

 

> Update ZooKeeper to 3.5.5
> -
>
> Key: KAFKA-8634
> URL: https://issues.apache.org/jira/browse/KAFKA-8634
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Ismael Juma
>Assignee: Ismael Juma
>Priority: Major
> Fix For: 2.4.0
>
>
> ZooKeeper 3.5.5 is the first stable release in the 3.5.x series. The key new 
> feature in ZK 3.5.x is TLS support.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (KAFKA-8282) Missing JMX bandwidth quota metrics for Produce and Fetch

2019-04-25 Thread James Cheng (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826524#comment-16826524
 ] 

James Cheng commented on KAFKA-8282:


In 1.0.0, the behavior was changed such that these metrics are only collected 
if client quotas are enabled.

The change was made in https://issues.apache.org/jira/browse/KAFKA-5402

[~rsivaram] mentioned that if you want these metrics, but don't want to enforce 
quotas, you can set your quota to something really high (she recommends 
Long.MAX_VALUE - 1). 
https://issues.apache.org/jira/browse/KAFKA-5402?focusedCommentId=16044100=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16044100

Note, it can't be Long.MAX_VALUE, because the code actually treats that as 
"disable quotas". See 
[https://github.com/apache/kafka/pull/3303/files#diff-ccd0deee5adb38987e4f009b749fd11cR141]

 

 

 

> Missing JMX bandwidth quota metrics for Produce and Fetch
> -
>
> Key: KAFKA-8282
> URL: https://issues.apache.org/jira/browse/KAFKA-8282
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.1.1
>Reporter: JMVM
>Priority: Major
> Attachments: Screen Shot 2019-04-23 at 20.59.21.png
>
>
> Recently I performed several *rolling upgrades following official steps* for 
> our Kafka brokers *from 0.11.0.1 to newer versions in different 
> environments*, and apparently working fine in all cases from functional point 
> of view: *producers and consumers working as expected*. 
> Specifically, I upgraded:
>  # From *0.11.0.1 to 1.0.0*, and then from *1.0.0 to 2.0.0*, and then *to* 
> *2.1.1*
>  # *From 0.11.0.1 directly to 2.1.1*
> However, in all cases *JMX bandwidth quota metrics for Fetch and Produce* 
> which used to show *all producers and consumers working with brokers* are 
> gone, just showing queue-size, in our *JMX monitoring clients specifically 
> Introscope Wily* *keeping same configuration* (see attached image).
> In fact, I removed Wily filter configuration for JMX in *order to show all 
> possible metrics, and keeping both Fetch and Produce still gone*.
> Note I checked if having proper version after rolling upgrade, for example, 
> for *2.1.1*, and being as expected:
> *ll /opt/kafka/libs/*
> *total 54032*
> *-rw-r--r-- 1 kafka kafka    69409 Jan  4 08:42 activation-1.1.1.jar*
> *-rw-r--r-- 1 kafka kafka    14768 Jan  4 08:42 
> aopalliance-repackaged-2.5.0-b42.jar*
> *-rw-r--r-- 1 kafka kafka    90347 Jan  4 08:42 argparse4j-0.7.0.jar*
> *-rw-r--r-- 1 kafka kafka    20437 Jan  4 08:40 
> audience-annotations-0.5.0.jar*
> *-rw-r--r-- 1 kafka kafka   501879 Jan  4 08:43 commons-lang3-3.8.1.jar*
> *-rw-r--r-- 1 kafka kafka    96801 Feb  8 18:32 connect-api-2.1.1.jar*
> *-rw-r--r-- 1 kafka kafka    18265 Feb  8 18:32 
> connect-basic-auth-extension-2.1.1.jar*
> *-rw-r--r-- 1 kafka kafka    20509 Feb  8 18:32 connect-file-2.1.1.jar*
> *-rw-r--r-- 1 kafka kafka    45489 Feb  8 18:32 connect-json-2.1.1.jar*
> *-rw-r--r-- 1 kafka kafka   466588 Feb  8 18:32 connect-runtime-2.1.1.jar*
> *-rw-r--r-- 1 kafka kafka    90358 Feb  8 18:32 connect-transforms-2.1.1.jar*
> *-rw-r--r-- 1 kafka kafka  2442625 Jan  4 08:43 guava-20.0.jar*
> *-rw-r--r-- 1 kafka kafka   186763 Jan  4 08:42 hk2-api-2.5.0-b42.jar*
> *-rw-r--r-- 1 kafka kafka   189454 Jan  4 08:42 hk2-locator-2.5.0-b42.jar*
> *-rw-r--r-- 1 kafka kafka   135317 Jan  4 08:42 hk2-utils-2.5.0-b42.jar*
> *-rw-r--r-- 1 kafka kafka    66894 Jan 11 21:28 jackson-annotations-2.9.8.jar*
> *-rw-r--r-- 1 kafka kafka   325619 Jan 11 21:27 jackson-core-2.9.8.jar*
> *-rw-r--r-- 1 kafka kafka  1347236 Jan 11 21:27 jackson-databind-2.9.8.jar*
> *-rw-r--r-- 1 kafka kafka    32373 Jan 11 21:28 jackson-jaxrs-base-2.9.8.jar*
> *-rw-r--r-- 1 kafka kafka    15861 Jan 11 21:28 
> jackson-jaxrs-json-provider-2.9.8.jar*
> *-rw-r--r-- 1 kafka kafka    32627 Jan 11 21:28 
> jackson-module-jaxb-annotations-2.9.8.jar*
> *-rw-r--r-- 1 kafka kafka   737884 Jan  4 08:43 javassist-3.22.0-CR2.jar*
> *-rw-r--r-- 1 kafka kafka    26366 Jan  4 08:42 javax.annotation-api-1.2.jar*
> *-rw-r--r-- 1 kafka kafka 2497 Jan  4 08:42 javax.inject-1.jar*
> *-rw-r--r-- 1 kafka kafka 5951 Jan  4 08:42 javax.inject-2.5.0-b42.jar*
> *-rw-r--r-- 1 kafka kafka    95806 Jan  4 08:42 javax.servlet-api-3.1.0.jar*
> *-rw-r--r-- 1 kafka kafka   126898 Jan  4 08:42 javax.ws.rs-api-2.1.1.jar*
> *-rw-r--r-- 1 kafka kafka   127509 Jan  4 08:42 javax.ws.rs-api-2.1.jar*
> *-rw-r--r-- 1 kafka kafka   125632 Jan  4 08:42 jaxb-api-2.3.0.jar*
> *-rw-r--r-- 1 kafka kafka   181563 Jan  4 08:42 jersey-client-2.27.jar*
> *-rw-r--r-- 1 kafka kafka  1140395 Jan  4 08:43 jersey-common-2.27.jar*
> *-rw-r--r-- 1 kafka kafka    18085 Jan  4 08:42 
> jersey-container-servlet-2.27.jar*
> *-rw-r--r-- 1 kafka kafka    59332 Jan  4 08:42 
> 

[jira] [Commented] (KAFKA-8180) Deleting large number of topics can block the controller for the time it takes to delete all of them

2019-04-04 Thread James Cheng (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809585#comment-16809585
 ] 

James Cheng commented on KAFKA-8180:


Looking forward to seeing this get fixed. This has locked up our clusters 
multiple times, and it required zookeeper surgery in order to recover. (Delete 
the /admin/delete_topics/* stuff, and bounce the controller). 

> Deleting large number of topics can block the controller for the time it 
> takes to delete all of them
> 
>
> Key: KAFKA-8180
> URL: https://issues.apache.org/jira/browse/KAFKA-8180
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Reporter: Gwen Shapira
>Priority: Major
>
> Scenario:
> - Create large number of topics (In my experiment: 400 topics with 12 
> partitions each )
> - Use the admin client to delete all of them in a single batch operation
> - Try to bounce another broker while this is going on
> As you can see from the logs and metrics - topic deletion happens 
> synchronously in the controller and it does not do anything else (leader 
> elections for instance) while it is busy deleting (which can take many 
> minutes for large batches).
> I recommend fixing it by throttling the deletes - no matter how large a batch 
> the client sent, the controller should delete a subset and complete a full 
> cycle before deleting the next subset.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-8069) Committed offsets get cleaned up right after the coordinator loading them back from __consumer_offsets in broker with old inter-broker protocol version (< 2.2)

2019-03-08 Thread James Cheng (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16788101#comment-16788101
 ] 

James Cheng commented on KAFKA-8069:


[~mjsax] [~hachikuji], is this critical enough that we will spin a new RC for 
2.2.0?

> Committed offsets get cleaned up right after the coordinator loading them 
> back from __consumer_offsets in broker with old inter-broker protocol version 
> (< 2.2)
> ---
>
> Key: KAFKA-8069
> URL: https://issues.apache.org/jira/browse/KAFKA-8069
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 2.1.0, 2.2.0, 2.1.1, 2.1.2, 2.2.1
>Reporter: Zhanxiang (Patrick) Huang
>Assignee: Zhanxiang (Patrick) Huang
>Priority: Critical
> Fix For: 2.2.0, 2.0.2, 2.1.2
>
>
> After the 2.1 release, if the broker hasn't been upgrade to the latest 
> inter-broker protocol version, 
> the committed offsets stored in the __consumer_offset topic will get cleaned 
> up way earlier than it should be when the offsets are loaded back from the 
> __consumer_offset topic in GroupCoordinator, which will happen during 
> leadership transition or after broker bounce.
> TL;DR
> For V1 on-disk format for __consumer_offsets, we have the *expireTimestamp* 
> field and if the inter-broker protocol (IBP) version is prior to 2.1 (prior 
> to 
> [KIP-211|https://cwiki.apache.org/confluence/display/KAFKA/KIP-211%3A+Revise+Expiration+Semantics+of+Consumer+Group+Offsets])
>  for a kafka 2.1 broker, the logic of getting the expired offsets looks like:
> {code:java}
> def getExpiredOffsets(baseTimestamp: CommitRecordMetadataAndOffset => Long): 
> Map[TopicPartition, OffsetAndMetadata] = {
>  offsets.filter {
>  case (topicPartition, commitRecordMetadataAndOffset) =>
>  ... && {
>  commitRecordMetadataAndOffset.offsetAndMetadata.expireTimestamp match {
>  case None =>
>  // current version with no per partition retention
>  currentTimestamp - baseTimestamp(commitRecordMetadataAndOffset) >= 
> offsetRetentionMs
>  case Some(expireTimestamp) =>
>  // older versions with explicit expire_timestamp field => old expiration 
> semantics is used
>  currentTimestamp >= expireTimestamp
>  }
>  }
>  }
>  }
> {code}
> The expireTimestamp in the on-disk offset record can only be set when storing 
> the committed offset in the __consumer_offset topic. But the GroupCoordinator 
> also has keep a in-memory representation for the expireTimestamp (see the 
> codes above), which can be set in the following two cases:
>  # Upon the GroupCoordinator receiving OffsetCommitRequest, the 
> expireTimestamp is set using the following logic:
> {code:java}
> expireTimestamp = offsetCommitRequest.retentionTime match {
>  case OffsetCommitRequest.DEFAULT_RETENTION_TIME => None
>  case retentionTime => Some(currentTimestamp + retentionTime)
> }
> {code}
> In all the latest client versions, the consumer will set out 
> OffsetCommitRequest with DEFAULT_RETENTION_TIME so the expireTimestamp will 
> always be None in this case. *This means any committed offset set in this 
> case will always hit the "case None" in the "getExpiredOffsets(...)" when 
> coordinator is doing the cleanup, which is correct.*
>  # Upon the GroupCoordinatorReceiving loading the committed offset stored in 
> the __consumer_offsets topic from disk, the expireTimestamp is set using the 
> following logic if IBP<2.1:
> {code:java}
> val expireTimestamp = 
> value.get(OFFSET_VALUE_EXPIRE_TIMESTAMP_FIELD_V1).asInstanceOf[Long]
> {code}
> and the logic to persist the expireTimestamp is:
> {code:java}
> // OffsetCommitRequest.DEFAULT_TIMESTAMP = -1
> value.set(OFFSET_VALUE_EXPIRE_TIMESTAMP_FIELD_V1, 
> offsetAndMetadata.expireTimestamp.getOrElse(OffsetCommitRequest.DEFAULT_TIMESTAMP))
> {code}
> Since the in-memory expireTimestamp will always be None in our case as 
> mentioned in 1), we will always store -1 on-disk. Therefore, when the offset 
> is loaded from the __consumer_offsets topic, the in-memory expireTimestamp 
> will always be set to -1. *This means any committed offset set in this case 
> will always hit "case Some(expireTimestamp)" in the "getExpiredOffsets(...)" 
> when coordinator is doing the cleanup, which basically indicates we will 
> always expire the committed offset on the first expiration check (which is 
> shortly after they are loaded from __consumer_offsets topic)*.
> I am able to reproduce this bug on my local box with one broker using 2.*,1.* 
> and 0.11.* consumer. The consumer will see null committed offset after the 
> broker is bounced.
> This bug is introduced by [PR-5690|https://github.com/apache/kafka/pull/5690] 
> in the kafka 2.1 release and the fix is very straight-forward, which is 
> basically set 

[jira] [Created] (KAFKA-7884) Docs for message.format.version and log.message.format.version show invalid (corrupt?) "valid values"

2019-01-29 Thread James Cheng (JIRA)
James Cheng created KAFKA-7884:
--

 Summary: Docs for message.format.version and 
log.message.format.version show invalid (corrupt?) "valid values"
 Key: KAFKA-7884
 URL: https://issues.apache.org/jira/browse/KAFKA-7884
 Project: Kafka
  Issue Type: Bug
  Components: documentation
Reporter: James Cheng


In the docs for message.format.version and log.message.format.version, the list 
of valid values is

 
{code:java}
kafka.api.ApiVersionValidator$@56aac163 
{code}
 

It appears it's simply doing a .toString on the class/instance.

At a minimum, we should remove this java-y-ness.

Even better is, it should show all the valid values.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (KAFKA-7657) Invalid reporting of stream state in Kafka streams application

2019-01-23 Thread James Cheng (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16750278#comment-16750278
 ] 

James Cheng edited comment on KAFKA-7657 at 1/23/19 5:59 PM:
-

This was fixed in [https://github.com/apache/kafka/pull/6091]

But the Jira didn't get updated/linked to it, because the title for that PR was 
"K7657" instead of "KAFKA-7657" /cc [~guozhang]

 


was (Author: wushujames):
This was fixed in [https://github.com/apache/kafka/pull/6091]

But the Jira didn't get updated/linked to it, because the title for that Jira 
is "K7657" instead of "KAFKA-7657"

 

> Invalid reporting of stream state in Kafka streams application
> --
>
> Key: KAFKA-7657
> URL: https://issues.apache.org/jira/browse/KAFKA-7657
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.0.1
>Reporter: Thomas Crowley
>Assignee: Guozhang Wang
>Priority: Major
>  Labels: bug
> Fix For: 2.2.0
>
>
> We have a streams application with 3 instances running, two of which are 
> reporting the state of REBALANCING even after they have been running for 
> days. Restarting the application has no effect on the stream state.
> This seems suspect because each instance appears to be processing messages, 
> and the kafka-consumer-groups CLI tool reports hardly any offset lag in any 
> of the partitions assigned to the REBALANCING consumers. Each partition seems 
> to be processing an equal amount of records too.
> Inspecting the state.dir on disk, it looks like the RocksDB state has been 
> built and hovers at the expected size on disk.
> This problem has persisted for us after we rebuilt our Kafka cluster and 
> reset topics + consumer groups in our dev environment.
> There is nothing in the logs (with level set to DEBUG) in both the broker or 
> the application that suggests something exceptional has happened causing the 
> application to be stuck REBALANCING.
> We are also running multiple streaming applications where this problem does 
> not exist.
> Two differences between this application and our other streaming applications 
> are:
>  * We have processing.guarantee set to exactly_once
>  * We are using a ValueTransformer which fetches from and puts data on a 
> windowed state store
> The REBALANCING state is returned from both polling the state method of our 
> KafkaStreams instance, and our custom metric which is derived from some logic 
> in a KafkaStreams.StateListener class attached via the setStateListener 
> method.
>  
> While I have provided a bit of context, before I reply with some reproducible 
> code - is there a simple way in which I can determine that my streams 
> application is in a RUNNING state without relying on the same mechanisms as 
> used above?
> Further, given that it seems like my application is actually running - could 
> this perhaps be a bug to do with how the stream state is being reported (in 
> the context of a transactional stream using the processor API)?
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7657) Invalid reporting of stream state in Kafka streams application

2019-01-23 Thread James Cheng (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16750278#comment-16750278
 ] 

James Cheng commented on KAFKA-7657:


This was fixed in [https://github.com/apache/kafka/pull/6091]

But the Jira didn't get updated/linked to it, because the title for that Jira 
is "K7657" instead of "KAFKA-7657"

 

> Invalid reporting of stream state in Kafka streams application
> --
>
> Key: KAFKA-7657
> URL: https://issues.apache.org/jira/browse/KAFKA-7657
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.0.1
>Reporter: Thomas Crowley
>Assignee: Guozhang Wang
>Priority: Major
>  Labels: bug
> Fix For: 2.2.0
>
>
> We have a streams application with 3 instances running, two of which are 
> reporting the state of REBALANCING even after they have been running for 
> days. Restarting the application has no effect on the stream state.
> This seems suspect because each instance appears to be processing messages, 
> and the kafka-consumer-groups CLI tool reports hardly any offset lag in any 
> of the partitions assigned to the REBALANCING consumers. Each partition seems 
> to be processing an equal amount of records too.
> Inspecting the state.dir on disk, it looks like the RocksDB state has been 
> built and hovers at the expected size on disk.
> This problem has persisted for us after we rebuilt our Kafka cluster and 
> reset topics + consumer groups in our dev environment.
> There is nothing in the logs (with level set to DEBUG) in both the broker or 
> the application that suggests something exceptional has happened causing the 
> application to be stuck REBALANCING.
> We are also running multiple streaming applications where this problem does 
> not exist.
> Two differences between this application and our other streaming applications 
> are:
>  * We have processing.guarantee set to exactly_once
>  * We are using a ValueTransformer which fetches from and puts data on a 
> windowed state store
> The REBALANCING state is returned from both polling the state method of our 
> KafkaStreams instance, and our custom metric which is derived from some logic 
> in a KafkaStreams.StateListener class attached via the setStateListener 
> method.
>  
> While I have provided a bit of context, before I reply with some reproducible 
> code - is there a simple way in which I can determine that my streams 
> application is in a RUNNING state without relying on the same mechanisms as 
> used above?
> Further, given that it seems like my application is actually running - could 
> this perhaps be a bug to do with how the stream state is being reported (in 
> the context of a transactional stream using the processor API)?
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-6964) Add ability to print all internal topic names

2019-01-20 Thread James Cheng (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-6964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16747400#comment-16747400
 ] 

James Cheng commented on KAFKA-6964:


[~guozhang], is the reason you are closing it as WONTFIX because of 
[~bbejeck]'s comment 26/Jun/18? Meaning, the primary reason to know the 
internal topic names was to be able to know what to add to an ACL, and the ACLs 
now support prefix matching, and the application writer knows the prefix, and 
therefore there is no need to expose the entire topic name?

> Add ability to print all internal topic names
> -
>
> Key: KAFKA-6964
> URL: https://issues.apache.org/jira/browse/KAFKA-6964
> Project: Kafka
>  Issue Type: New Feature
>  Components: streams
>Reporter: Bill Bejeck
>Assignee: Jagadesh Adireddi
>Priority: Major
>  Labels: needs-kip
>
> For security access reasons some streams users need to build all internal 
> topics before deploying their streams application.  While it's possible to 
> get all internal topic names from the {{Topology#describe()}} method, it 
> would be nice to have a separate method that prints out only the internal 
> topic names to ease the process.
> I think this change will require a KIP, so I've added the appropriate label.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KAFKA-7797) Replication throttling configs aren't in the docs

2019-01-08 Thread James Cheng (JIRA)
James Cheng created KAFKA-7797:
--

 Summary: Replication throttling configs aren't in the docs
 Key: KAFKA-7797
 URL: https://issues.apache.org/jira/browse/KAFKA-7797
 Project: Kafka
  Issue Type: Improvement
  Components: documentation
Affects Versions: 2.1.0
Reporter: James Cheng


Docs for the following configs are not on the website:
 * leader.replication.throttled.rate
 * follower.replication.throttled.rate
 * replica.alter.log.dirs.io.max.bytes.per.second

 

They are available in the Operations section titled "Limiting Bandwidth Usage 
during Data Migration", but they are not in the general config section.

I think these are generally applicable, right? Not just during 
kafka-reassign-partitions.sh? If so, then they should be in the auto-generated 
docs.

Related: I think none of the configs in 
core/src/main/scala/kafka/server/DynamicConfig.scala are in the generated docs.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-6951) Implement offset expiration semantics for unsubscribed topics

2018-11-28 Thread James Cheng (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-6951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702167#comment-16702167
 ] 

James Cheng commented on KAFKA-6951:


[~vultron81]: I think that the kafka-consumer-groups.sh tool allows you to 
delete offsets for a specific topic within a consumer group. Something like

$ kafka-consumer-groups.sh --group groupId --topic topicName --delete

 

 

 

> Implement offset expiration semantics for unsubscribed topics
> -
>
> Key: KAFKA-6951
> URL: https://issues.apache.org/jira/browse/KAFKA-6951
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Reporter: Vahid Hashemian
>Assignee: Vahid Hashemian
>Priority: Major
> Fix For: 2.2.0
>
>
> [This 
> portion|https://cwiki.apache.org/confluence/display/KAFKA/KIP-211%3A+Revise+Expiration+Semantics+of+Consumer+Group+Offsets#KIP-211:ReviseExpirationSemanticsofConsumerGroupOffsets-UnsubscribingfromaTopic]
>  of KIP-211 will be implemented separately from the main PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7615) Support different topic name in source and destination server in Mirrormaker

2018-11-14 Thread James Cheng (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687423#comment-16687423
 ] 

James Cheng commented on KAFKA-7615:


[~adeetikaushal]: This can be implemented by supplying a MessageHandler to 
mirrormaker. See 
[https://github.com/gwenshap/kafka-examples/tree/master/MirrorMakerHandler] for 
an example.

> Support different topic name in source and destination server in Mirrormaker
> 
>
> Key: KAFKA-7615
> URL: https://issues.apache.org/jira/browse/KAFKA-7615
> Project: Kafka
>  Issue Type: New Feature
>  Components: mirrormaker
>Reporter: Adeeti Kaushal
>Priority: Minor
>
> Currently mirrormaker only supports same topic name in source and destination 
> broker. Support for different topic names in source and destination brokers 
> is needed.
>  
> source broker : topic name -> topicA
> destination broker: topic name -> topicB
>  
> MirrorData from topicA to topicB



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-4084) automated leader rebalance causes replication downtime for clusters with too many partitions

2018-10-12 Thread James Cheng (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16648586#comment-16648586
 ] 

James Cheng commented on KAFKA-4084:


[~jiaxinye]: You are probably hitting 
https://issues.apache.org/jira/browse/KAFKA-7299 . We saw the same behavior, 
and our brokers were taking 4 hours to get back into sync. We applied that 
patch, and it dropped to 2.5 minutes.

> automated leader rebalance causes replication downtime for clusters with too 
> many partitions
> 
>
> Key: KAFKA-4084
> URL: https://issues.apache.org/jira/browse/KAFKA-4084
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 0.8.2.2, 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1
>Reporter: Tom Crayford
>Priority: Major
>  Labels: reliability
> Fix For: 1.1.0
>
>
> If you enable {{auto.leader.rebalance.enable}} (which is on by default), and 
> you have a cluster with many partitions, there is a severe amount of 
> replication downtime following a restart. This causes 
> `UnderReplicatedPartitions` to fire, and replication is paused.
> This is because the current automated leader rebalance mechanism changes 
> leaders for *all* imbalanced partitions at once, instead of doing it 
> gradually. This effectively stops all replica fetchers in the cluster 
> (assuming there are enough imbalanced partitions), and restarts them. This 
> can take minutes on busy clusters, during which no replication is happening 
> and user data is at risk. Clients with {{acks=-1}} also see issues at this 
> time, because replication is effectively stalled.
> To quote Todd Palino from the mailing list:
> bq. There is an admin CLI command to trigger the preferred replica election 
> manually. There is also a broker configuration “auto.leader.rebalance.enable” 
> which you can set to have the broker automatically perform the PLE when 
> needed. DO NOT USE THIS OPTION. There are serious performance issues when 
> doing so, especially on larger clusters. It needs some development work that 
> has not been fully identified yet.
> This setting is extremely useful for smaller clusters, but with high 
> partition counts causes the huge issues stated above.
> One potential fix could be adding a new configuration for the number of 
> partitions to do automated leader rebalancing for at once, and *stop* once 
> that number of leader rebalances are in flight, until they're done. There may 
> be better mechanisms, and I'd love to hear if anybody has any ideas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (KAFKA-7332) Improve error message when trying to produce message without key for compacted topic

2018-09-19 Thread James Cheng (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16621106#comment-16621106
 ] 

James Cheng edited comment on KAFKA-7332 at 9/19/18 8:00 PM:
-

Thanks [~lindong]. I agree with you that design-wise it makes sense, and 
practice-wise, I'm not sure.

KAFKA-4808 says that you need to exhaust retries or request.timeout.ms expires. 
I'm not sure how those two interact. Is the request.timeout.ms a per-request 
thing and so you actually have to wait for (retries * request.timeout.ms)? 
According to 
[https://docs.confluent.io/current/streams/developer-guide/config-streams.html,]
 kafka streams has retries = 10. And in kafka 2.0.0, producer 
request.timeout.ms = 30 seconds. If you multiply those together, you get 5 
minutes until you get a timeout. As you said, it's probably small compared to 
working through other workflow stuff, but 5 minutes is quite a while to wait to 
get an error. I would probably end up killing my application repeatedly 
thinking that something else was wrong, and might not ever get to the 5 minutes.

I  don't think I've personally run into this. I'm just thinking through what 
the developer experience would be. [~gwenshap] ran into it recently, but again, 
not sure how much more widely this happens.


was (Author: wushujames):
Thanks [~lindong]. I agree with you that design-wise it makes sense, and 
practice-wise, I'm not sure.

KAFKA-4808 says that you need to exhaust retries or request.timeout.ms expires. 
I'm not sure how those two interact. Is the request.timeout.ms a per-request 
thing and so you actually have to wait for (retries * request.timeout.ms)? 
According to 
[https://docs.confluent.io/current/streams/developer-guide/config-streams.html,]
 kafka streams has retries = 10. And in kafka 2.0.0, producer 
request.timeout.ms = 30 seconds. If you multiple those together, you get 5 
minutes until you get a timeout. As you said, it's probably small compared to 
working through other workflow stuff, but 5 minutes is quite a while to wait to 
get an error. I would probably end up killing my application repeatedly 
thinking that something else was wrong, and might not ever get to the 5 minutes.


I  don't think I've personally run into this. I'm just thinking through what 
the developer experience would be. [~gwenshap] ran into it recently, but again, 
not sure how much more widely this happens.

> Improve error message when trying to produce message without key for 
> compacted topic
> 
>
> Key: KAFKA-7332
> URL: https://issues.apache.org/jira/browse/KAFKA-7332
> Project: Kafka
>  Issue Type: Improvement
>  Components: producer 
>Affects Versions: 1.1.0
>Reporter: Patrik Kleindl
>Assignee: Manikumar
>Priority: Trivial
> Fix For: 2.1.0
>
>
> Goal:
> Return a specific error message like e.g. "Message without a key is not valid 
> for a compacted topic" when trying to produce such a message instead of a 
> CorruptRecordException.
>  
> > Yesterday we had the following exception:
> > 
> > Exception thrown when sending a message with key='null' and payload='...'
> > to topic sometopic:: org.apache.kafka.common.errors.CorruptRecordException:
> > This message has failed its CRC checksum, exceeds the valid size, or is
> > otherwise corrupt.
> > 
> > The cause was identified with the help of
> > 
> >[https://stackoverflow.com/questions/49098274/kafka-stream-get-corruptrecordexception]
> > 
> > Is it possible / would it makes sense to open an issue to improve the error
> > message for this case?
> > A simple "Message without a key is not valid for a compacted topic" would
> > suffice and point a user  in the right direction.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7332) Improve error message when trying to produce message without key for compacted topic

2018-09-19 Thread James Cheng (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16621106#comment-16621106
 ] 

James Cheng commented on KAFKA-7332:


Thanks [~lindong]. I agree with you that design-wise it makes sense, and 
practice-wise, I'm not sure.

KAFKA-4808 says that you need to exhaust retries or request.timeout.ms expires. 
I'm not sure how those two interact. Is the request.timeout.ms a per-request 
thing and so you actually have to wait for (retries * request.timeout.ms)? 
According to 
[https://docs.confluent.io/current/streams/developer-guide/config-streams.html,]
 kafka streams has retries = 10. And in kafka 2.0.0, producer 
request.timeout.ms = 30 seconds. If you multiple those together, you get 5 
minutes until you get a timeout. As you said, it's probably small compared to 
working through other workflow stuff, but 5 minutes is quite a while to wait to 
get an error. I would probably end up killing my application repeatedly 
thinking that something else was wrong, and might not ever get to the 5 minutes.


I  don't think I've personally run into this. I'm just thinking through what 
the developer experience would be. [~gwenshap] ran into it recently, but again, 
not sure how much more widely this happens.

> Improve error message when trying to produce message without key for 
> compacted topic
> 
>
> Key: KAFKA-7332
> URL: https://issues.apache.org/jira/browse/KAFKA-7332
> Project: Kafka
>  Issue Type: Improvement
>  Components: producer 
>Affects Versions: 1.1.0
>Reporter: Patrik Kleindl
>Assignee: Manikumar
>Priority: Trivial
> Fix For: 2.1.0
>
>
> Goal:
> Return a specific error message like e.g. "Message without a key is not valid 
> for a compacted topic" when trying to produce such a message instead of a 
> CorruptRecordException.
>  
> > Yesterday we had the following exception:
> > 
> > Exception thrown when sending a message with key='null' and payload='...'
> > to topic sometopic:: org.apache.kafka.common.errors.CorruptRecordException:
> > This message has failed its CRC checksum, exceeds the valid size, or is
> > otherwise corrupt.
> > 
> > The cause was identified with the help of
> > 
> >[https://stackoverflow.com/questions/49098274/kafka-stream-get-corruptrecordexception]
> > 
> > Is it possible / would it makes sense to open an issue to improve the error
> > message for this case?
> > A simple "Message without a key is not valid for a compacted topic" would
> > suffice and point a user  in the right direction.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7332) Improve error message when trying to produce message without key for compacted topic

2018-09-19 Thread James Cheng (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16621021#comment-16621021
 ] 

James Cheng commented on KAFKA-7332:


[~lindong], this is somewhat related to 
https://issues.apache.org/jira/browse/KAFKA-4808. While we are making this 
change, should we also address that JIRA? 

> Improve error message when trying to produce message without key for 
> compacted topic
> 
>
> Key: KAFKA-7332
> URL: https://issues.apache.org/jira/browse/KAFKA-7332
> Project: Kafka
>  Issue Type: Improvement
>  Components: producer 
>Affects Versions: 1.1.0
>Reporter: Patrik Kleindl
>Assignee: Manikumar
>Priority: Trivial
> Fix For: 2.1.0
>
>
> Goal:
> Return a specific error message like e.g. "Message without a key is not valid 
> for a compacted topic" when trying to produce such a message instead of a 
> CorruptRecordException.
>  
> > Yesterday we had the following exception:
> > 
> > Exception thrown when sending a message with key='null' and payload='...'
> > to topic sometopic:: org.apache.kafka.common.errors.CorruptRecordException:
> > This message has failed its CRC checksum, exceeds the valid size, or is
> > otherwise corrupt.
> > 
> > The cause was identified with the help of
> > 
> >[https://stackoverflow.com/questions/49098274/kafka-stream-get-corruptrecordexception]
> > 
> > Is it possible / would it makes sense to open an issue to improve the error
> > message for this case?
> > A simple "Message without a key is not valid for a compacted topic" would
> > suffice and point a user  in the right direction.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7286) Loading offsets and group metadata hangs with large group metadata records

2018-09-10 Thread James Cheng (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608792#comment-16608792
 ] 

James Cheng commented on KAFKA-7286:


How many members would be needed in a group to trigger this scenario?

> Loading offsets and group metadata hangs with large group metadata records
> --
>
> Key: KAFKA-7286
> URL: https://issues.apache.org/jira/browse/KAFKA-7286
> Project: Kafka
>  Issue Type: Bug
>Reporter: Flavien Raynaud
>Assignee: Flavien Raynaud
>Priority: Minor
> Fix For: 2.1.0
>
>
> When a (Kafka-based) consumer group contains many members, group metadata 
> records (in the {{__consumer-offsets}} topic) may happen to be quite large.
> Increasing the {{message.max.bytes}} makes storing these records possible.
>  Loading them when a broker restart is done via 
> [doLoadGroupsAndOffsets|https://github.com/apache/kafka/blob/418a91b5d4e3a0579b91d286f61c2b63c5b4a9b6/core/src/main/scala/kafka/coordinator/group/GroupMetadataManager.scala#L504].
>  However, this method relies on the {{offsets.load.buffer.size}} 
> configuration to create a 
> [buffer|https://github.com/apache/kafka/blob/418a91b5d4e3a0579b91d286f61c2b63c5b4a9b6/core/src/main/scala/kafka/coordinator/group/GroupMetadataManager.scala#L513]
>  that will contain the records being loaded.
> If a group metadata record is too large for this buffer, the loading method 
> will get stuck trying to load records (in a tight loop) into a buffer that 
> cannot accommodate a single record.
> 
> For example, if the {{__consumer-offsets-9}} partition contains a record 
> smaller than {{message.max.bytes}} but larger than 
> {{offsets.load.buffer.size}}, logs would indicate the following:
> {noformat}
> ...
> [2018-08-13 21:00:21,073] INFO [GroupMetadataManager brokerId=0] Scheduling 
> loading of offsets and group metadata from __consumer_offsets-9 
> (kafka.coordinator.group.GroupMetadataManager)
> ...
> {noformat}
> But logs will never contain the expected {{Finished loading offsets and group 
> metadata from ...}} line.
> Consumers whose group are assigned to this partition will see {{Marking the 
> coordinator dead}} and will never be able to stabilize and make progress.
> 
> From what I could gather in the code, it seems that:
>  - 
> [fetchDataInfo|https://github.com/apache/kafka/blob/418a91b5d4e3a0579b91d286f61c2b63c5b4a9b6/core/src/main/scala/kafka/coordinator/group/GroupMetadataManager.scala#L522]
>  returns at least one record (even if larger than 
> {{offsets.load.buffer.size}}, thanks to {{minOneMessage = true}})
>  - No fully-readable record is stored in the buffer with 
> [fileRecords.readInto(buffer, 
> 0)|https://github.com/apache/kafka/blob/418a91b5d4e3a0579b91d286f61c2b63c5b4a9b6/core/src/main/scala/kafka/coordinator/group/GroupMetadataManager.scala#L528]
>  (too large to fit in the buffer)
>  - 
> [memRecords.batches|https://github.com/apache/kafka/blob/418a91b5d4e3a0579b91d286f61c2b63c5b4a9b6/core/src/main/scala/kafka/coordinator/group/GroupMetadataManager.scala#L532]
>  returns an empty iterator
>  - 
> [currOffset|https://github.com/apache/kafka/blob/418a91b5d4e3a0579b91d286f61c2b63c5b4a9b6/core/src/main/scala/kafka/coordinator/group/GroupMetadataManager.scala#L590]
>  never advances, hence loading the partition hangs forever.
> 
> It would be great to let the partition load even if a record is larger than 
> the configured {{offsets.load.buffer.size}} limit. The fact that 
> {{minOneMessage = true}} when reading records seems to indicate it might be a 
> good idea for the buffer to accommodate at least one record.
> If you think the limit should stay a hard limit, then at least adding a log 
> line indicating {{offsets.load.buffer.size}} is not large enough and should 
> be increased. Otherwise, one can only guess and dig through the code to 
> figure out what is happening :)
> I will try to open a PR with the first idea (allowing large records to be 
> read when needed) soon, but any feedback from anyone who also had the same 
> issue in the past would be appreciated :)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-6746) Allow ZK Znode configurable for Kafka broker

2018-07-12 Thread James Cheng (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-6746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16542310#comment-16542310
 ] 

James Cheng commented on KAFKA-6746:


I noticed the same thing as [~gsbiju]. I think [~mimaison] may have not looked 
for existing JIRAs before submitting the fix. That's too bad, since Biju had a 
fix ready to merge.

Also, unfortunately, I don't think this Jira was marked as "Patch Available" 
and it was unassigned, so it might not have been noticed that this Jira had a 
fix ready to merge. Sorry Biju, we (the community) didn't help to guide you 
through the process. Here is the documentation we have on the process: 
https://cwiki.apache.org/confluence/display/KAFKA/Contributing+Code+Changes

> Allow ZK Znode configurable for Kafka broker 
> -
>
> Key: KAFKA-6746
> URL: https://issues.apache.org/jira/browse/KAFKA-6746
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Biju Nair
>Priority: Major
>
> By allowing users to specify the {{Znode}} to be used along with the {{ZK 
> Quorum}}, users will be able to reuse a {{ZK}} cluster for many {{Kafka}} 
> clusters. This will help in reducing the {{ZK}} cluster footprint especially 
> in non production environment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7144) Kafka Streams doesn't properly balance partition assignment

2018-07-11 Thread James Cheng (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539598#comment-16539598
 ] 

James Cheng commented on KAFKA-7144:


The attached repro java file has all the code. I didn’t specify any threads, so 
it was using the kafka streams default of 1 thread.

Broker version was 1.1.0
Kafka stream version was 1.1.0

> Kafka Streams doesn't properly balance partition assignment
> ---
>
> Key: KAFKA-7144
> URL: https://issues.apache.org/jira/browse/KAFKA-7144
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 1.1.0
>Reporter: James Cheng
>Priority: Major
> Attachments: OneThenTwelve.java
>
>
> Kafka Streams doesn't always spread the tasks across all available 
> instances/threads
> I have a topology which consumes a single partition topic and goes .through() 
> a 12 partition topic. The makes 13 partitions.
>  
> I then started 2 instances of the application. I would have expected the 13 
> partitions to be split across the 2 instances roughly evenly (7 partitions on 
> one, 6 partitions on the other).
> Instead, one instance gets 12 partitions, and the other instance gets 1 
> partition.
>  
> Repro case attached. I ran it a couple times, and it was fairly repeatable.
> Setup for the repro:
> {code:java}
> $ ./bin/kafka-topics.sh --zookeeper localhost --create --topic one 
> --partitions 1 --replication-factor 1 
> $ ./bin/kafka-topics.sh --zookeeper localhost --create --topic twelve 
> --partitions 12 --replication-factor 1
> $ echo foo | kafkacat -P -b 127.0.0.1 -t one
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7141) kafka-consumer-group doesn't describe existing group

2018-07-10 Thread James Cheng (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539373#comment-16539373
 ] 

James Cheng commented on KAFKA-7141:


I think it's a little odd that kafka-consumer-groups doesn't show partition 
assignment at all, when there are no offsets.

 

Currently, if there are 2 partitions (partitions 1 and 2)
 * A) Active consumer, no committed offsets on either of them means that you 
see nothing. No group assignment, no partitions.
 * B) Active consumer, committed offsets on 1, no committed offsets on 2, means 
that you will see rows for both of them, but the CURRENT-OFFSET field for 
partition 2 will have a hyphen in it.
 * C) Active consumer, Committed offsets on both 1 and 2, means you will see 
all the data
 * D) No active consumer, committed offsets on both 1 and 2, means you will see 
the rows, but CONSUMER-ID/HOST/CLIENT-ID will have hyphens.

This Jira is talking about "A".
I would have expected that "A" would display similar to "B". That you would see 
partition assignments, but "-" wherever there are missing offsets.

> kafka-consumer-group doesn't describe existing group
> 
>
> Key: KAFKA-7141
> URL: https://issues.apache.org/jira/browse/KAFKA-7141
> Project: Kafka
>  Issue Type: Bug
>  Components: admin
>Affects Versions: 0.11.0.0, 1.0.1
>Reporter: Bohdana Panchenko
>Priority: Major
>
> I am running two consumers: akka-stream-kafka consumer with standard config 
> section as described in the 
> [https://doc.akka.io/docs/akka-stream-kafka/current/consumer.html] and  
> kafka-console-consumer.
> akka-stream-kafka consumer configuration looks like this
> {color:#33}_akka.kafka.consumer{_{color}
> {color:#33}  _kafka-clients{_{color}
> {color:#33}    _group.id = "myakkastreamkafka-1"_{color}
> {color:#33}   _enable.auto.commit = false_{color}
> }
> {color:#33} }{color}
>  
>  I am able to see the both groups with the command
>  
>  *kafka-consumer-groups --bootstrap-server 127.0.0.1:9092 --list*
>  _Note: This will not show information about old Zookeeper-based consumers._
>  
>  _myakkastreamkafka-1_
>  _console-consumer-57171_
> {color:#33}I am able to view details about the console consumer 
> group{color}
> *kafka-consumer-groups --describe --bootstrap-server 127.0.0.1:9092 --group 
> console-consumer-57171*
>  _{color:#205081}Note: This will not show information about old 
> Zookeeper-based consumers.{color}_
> _{color:#205081}TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID 
> HOST CLIENT-ID{color}_
>  _{color:#205081}STREAM-TEST 0 0 0 0 
> consumer-1-6b928e07-196a-4322-9928-068681617878 /172.19.0.4 consumer-1{color}_
> {color:#33}But the command to describe my akka stream consumer gives me 
> empty output:{color}
> *kafka-consumer-groups --describe --bootstrap-server 127.0.0.1:9092 --group 
> myakkastreamkafka-1*
>  {color:#205081}_Note: This will not show information about old 
> Zookeeper-based consumers._{color}
> {color:#205081}_TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID 
> HOST CLIENT-ID_{color}
>  
> {color:#33}That is strange. Can you please check the issue?{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KAFKA-7144) Kafka Streams doesn't properly balance partition assignment

2018-07-09 Thread James Cheng (JIRA)
James Cheng created KAFKA-7144:
--

 Summary: Kafka Streams doesn't properly balance partition 
assignment
 Key: KAFKA-7144
 URL: https://issues.apache.org/jira/browse/KAFKA-7144
 Project: Kafka
  Issue Type: Bug
  Components: streams
Affects Versions: 1.1.0
Reporter: James Cheng
 Attachments: OneThenTwelve.java

Kafka Streams doesn't always spread the tasks across all available 
instances/threads

I have a topology which consumes a single partition topic and goes .through() a 
12 partition topic. The makes 13 partitions.

 

I then started 2 instances of the application. I would have expected the 13 
partitions to be split across the 2 instances roughly evenly (7 partitions on 
one, 6 partitions on the other).

Instead, one instance gets 12 partitions, and the other instance gets 1 
partition.

 

Repro case attached. I ran it a couple times, and it was fairly repeatable.

Setup for the repro:
{code:java}
$ ./bin/kafka-topics.sh --zookeeper localhost --create --topic one --partitions 
1 --replication-factor 1 
$ ./bin/kafka-topics.sh --zookeeper localhost --create --topic twelve 
--partitions 12 --replication-factor 1
$ echo foo | kafkacat -P -b 127.0.0.1 -t one
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (KAFKA-1342) Slow controlled shutdowns can result in stale shutdown requests

2018-07-07 Thread James Cheng (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535641#comment-16535641
 ] 

James Cheng edited comment on KAFKA-1342 at 7/7/18 6:12 AM:


In an older release, there was a logging “bug” in the controller where upon any 
leadership change in a single partition, it would log the state of *all 
partitions in the cluster*. That was probably the cause of the slowness you 
were seeing.

It got fixed at one point. I don’t remember which release it was fixed in, but 
I’m pretty sure that the “bug” existed in 0.10.2


was (Author: wushujames):
In an older release, there was a logging “bug” in the controller where upon any 
leadership change in a single partition, it would would the state of *all 
partitions in the cluster*. That was probably the cause of the slowness you 
were seeing.

It got fixed at one point. I don’t remember which release it was fixed in, but 
I’m pretty sure that the “bug” existed in 0.10.2

> Slow controlled shutdowns can result in stale shutdown requests
> ---
>
> Key: KAFKA-1342
> URL: https://issues.apache.org/jira/browse/KAFKA-1342
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.1
>Reporter: Joel Koshy
>Assignee: Joel Koshy
>Priority: Critical
>  Labels: newbie++, newbiee, reliability
>
> I don't think this is a bug introduced in 0.8.1., but triggered by the fact
> that controlled shutdown seems to have become slower in 0.8.1 (will file a
> separate ticket to investigate that). When doing a rolling bounce, it is
> possible for a bounced broker to stop all its replica fetchers since the
> previous PID's shutdown requests are still being shutdown.
> - 515 is the controller
> - Controlled shutdown initiated for 503
> - Controller starts controlled shutdown for 503
> - The controlled shutdown takes a long time in moving leaders and moving
>   follower replicas on 503 to the offline state.
> - So 503's read from the shutdown channel times out and a new channel is
>   created. It issues another shutdown request.  This request (since it is a
>   new channel) is accepted at the controller's socket server but then waits
>   on the broker shutdown lock held by the previous controlled shutdown which
>   is still in progress.
> - The above step repeats for the remaining retries (six more requests).
> - 503 hits SocketTimeout exception on reading the response of the last
>   shutdown request and proceeds to do an unclean shutdown.
> - The controller's onBrokerFailure call-back fires and moves 503's replicas
>   to offline (not too important in this sequence).
> - 503 is brought back up.
> - The controller's onBrokerStartup call-back fires and moves its replicas
>   (and partitions) to online state. 503 starts its replica fetchers.
> - Unfortunately, the (phantom) shutdown requests are still being handled and
>   the controller sends StopReplica requests to 503.
> - The first shutdown request finally finishes (after 76 minutes in my case!).
> - The remaining shutdown requests also execute and do the same thing (sends
>   StopReplica requests for all partitions to
>   503).
> - The remaining requests complete quickly because they end up not having to
>   touch zookeeper paths - no leaders left on the broker and no need to
>   shrink ISR in zookeeper since it has already been done by the first
>   shutdown request.
> - So in the end-state 503 is up, but effectively idle due to the previous
>   PID's shutdown requests.
> There are some obvious fixes that can be made to controlled shutdown to help
> address the above issue. E.g., we don't really need to move follower
> partitions to Offline. We did that as an "optimization" so the broker falls
> out of ISR sooner - which is helpful when producers set required.acks to -1.
> However it adds a lot of latency to controlled shutdown. Also, (more
> importantly) we should have a mechanism to abort any stale shutdown process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-1342) Slow controlled shutdowns can result in stale shutdown requests

2018-07-07 Thread James Cheng (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535644#comment-16535644
 ] 

James Cheng commented on KAFKA-1342:


Found it: 

https://github.com/apache/kafka/pull/4075

And

https://issues.apache.org/jira/browse/KAFKA-6116





> Slow controlled shutdowns can result in stale shutdown requests
> ---
>
> Key: KAFKA-1342
> URL: https://issues.apache.org/jira/browse/KAFKA-1342
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.1
>Reporter: Joel Koshy
>Assignee: Joel Koshy
>Priority: Critical
>  Labels: newbie++, newbiee, reliability
>
> I don't think this is a bug introduced in 0.8.1., but triggered by the fact
> that controlled shutdown seems to have become slower in 0.8.1 (will file a
> separate ticket to investigate that). When doing a rolling bounce, it is
> possible for a bounced broker to stop all its replica fetchers since the
> previous PID's shutdown requests are still being shutdown.
> - 515 is the controller
> - Controlled shutdown initiated for 503
> - Controller starts controlled shutdown for 503
> - The controlled shutdown takes a long time in moving leaders and moving
>   follower replicas on 503 to the offline state.
> - So 503's read from the shutdown channel times out and a new channel is
>   created. It issues another shutdown request.  This request (since it is a
>   new channel) is accepted at the controller's socket server but then waits
>   on the broker shutdown lock held by the previous controlled shutdown which
>   is still in progress.
> - The above step repeats for the remaining retries (six more requests).
> - 503 hits SocketTimeout exception on reading the response of the last
>   shutdown request and proceeds to do an unclean shutdown.
> - The controller's onBrokerFailure call-back fires and moves 503's replicas
>   to offline (not too important in this sequence).
> - 503 is brought back up.
> - The controller's onBrokerStartup call-back fires and moves its replicas
>   (and partitions) to online state. 503 starts its replica fetchers.
> - Unfortunately, the (phantom) shutdown requests are still being handled and
>   the controller sends StopReplica requests to 503.
> - The first shutdown request finally finishes (after 76 minutes in my case!).
> - The remaining shutdown requests also execute and do the same thing (sends
>   StopReplica requests for all partitions to
>   503).
> - The remaining requests complete quickly because they end up not having to
>   touch zookeeper paths - no leaders left on the broker and no need to
>   shrink ISR in zookeeper since it has already been done by the first
>   shutdown request.
> - So in the end-state 503 is up, but effectively idle due to the previous
>   PID's shutdown requests.
> There are some obvious fixes that can be made to controlled shutdown to help
> address the above issue. E.g., we don't really need to move follower
> partitions to Offline. We did that as an "optimization" so the broker falls
> out of ISR sooner - which is helpful when producers set required.acks to -1.
> However it adds a lot of latency to controlled shutdown. Also, (more
> importantly) we should have a mechanism to abort any stale shutdown process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-1342) Slow controlled shutdowns can result in stale shutdown requests

2018-07-07 Thread James Cheng (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535641#comment-16535641
 ] 

James Cheng commented on KAFKA-1342:


In an older release, there was a logging “bug” in the controller where upon any 
leadership change in a single partition, it would would the state of *all 
partitions in the cluster*. That was probably the cause of the slowness you 
were seeing.

It got fixed at one point. I don’t remember which release it was fixed in, but 
I’m pretty sure that the “bug” existed in 0.10.2

> Slow controlled shutdowns can result in stale shutdown requests
> ---
>
> Key: KAFKA-1342
> URL: https://issues.apache.org/jira/browse/KAFKA-1342
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.1
>Reporter: Joel Koshy
>Assignee: Joel Koshy
>Priority: Critical
>  Labels: newbie++, newbiee, reliability
>
> I don't think this is a bug introduced in 0.8.1., but triggered by the fact
> that controlled shutdown seems to have become slower in 0.8.1 (will file a
> separate ticket to investigate that). When doing a rolling bounce, it is
> possible for a bounced broker to stop all its replica fetchers since the
> previous PID's shutdown requests are still being shutdown.
> - 515 is the controller
> - Controlled shutdown initiated for 503
> - Controller starts controlled shutdown for 503
> - The controlled shutdown takes a long time in moving leaders and moving
>   follower replicas on 503 to the offline state.
> - So 503's read from the shutdown channel times out and a new channel is
>   created. It issues another shutdown request.  This request (since it is a
>   new channel) is accepted at the controller's socket server but then waits
>   on the broker shutdown lock held by the previous controlled shutdown which
>   is still in progress.
> - The above step repeats for the remaining retries (six more requests).
> - 503 hits SocketTimeout exception on reading the response of the last
>   shutdown request and proceeds to do an unclean shutdown.
> - The controller's onBrokerFailure call-back fires and moves 503's replicas
>   to offline (not too important in this sequence).
> - 503 is brought back up.
> - The controller's onBrokerStartup call-back fires and moves its replicas
>   (and partitions) to online state. 503 starts its replica fetchers.
> - Unfortunately, the (phantom) shutdown requests are still being handled and
>   the controller sends StopReplica requests to 503.
> - The first shutdown request finally finishes (after 76 minutes in my case!).
> - The remaining shutdown requests also execute and do the same thing (sends
>   StopReplica requests for all partitions to
>   503).
> - The remaining requests complete quickly because they end up not having to
>   touch zookeeper paths - no leaders left on the broker and no need to
>   shrink ISR in zookeeper since it has already been done by the first
>   shutdown request.
> - So in the end-state 503 is up, but effectively idle due to the previous
>   PID's shutdown requests.
> There are some obvious fixes that can be made to controlled shutdown to help
> address the above issue. E.g., we don't really need to move follower
> partitions to Offline. We did that as an "optimization" so the broker falls
> out of ISR sooner - which is helpful when producers set required.acks to -1.
> However it adds a lot of latency to controlled shutdown. Also, (more
> importantly) we should have a mechanism to abort any stale shutdown process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-5510) Streams should commit all offsets regularly

2018-06-06 Thread James Cheng (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-5510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16503897#comment-16503897
 ] 

James Cheng commented on KAFKA-5510:


KIP-211 should address the issue of offsets disappearing on low-traffic 
partitions. 
[https://cwiki.apache.org/confluence/display/KAFKA/KIP-211%3A+Revise+Expiration+Semantics+of+Consumer+Group+Offsets]
 . Not sure when that is going to get into core, though.

 

 

> Streams should commit all offsets regularly
> ---
>
> Key: KAFKA-5510
> URL: https://issues.apache.org/jira/browse/KAFKA-5510
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Matthias J. Sax
>Priority: Major
>
> Currently, Streams commits only offsets of partitions it did process records 
> for. Thus, if a partition does not have any data for longer then 
> {{offsets.retention.minutes}} (default 1 day) the latest committed offset 
> get's lost. On failure or restart {{auto.offset.rese}} kicks in potentially 
> resulting in reprocessing old data.
> Thus, Streams should commit _all_ offset on a regular basis. Not sure what 
> the overhead of a commit is -- if it's too expensive to commit all offsets on 
> regular commit, we could also have a second config that specifies an 
> "commit.all.interval".
> This relates to https://issues.apache.org/jira/browse/KAFKA-3806, so we 
> should sync to get a solid overall solution.
> At the same time, it might be better to change the semantics of 
> {{offsets.retention.minutes}} in the first place. It might be better to apply 
> this setting only if the consumer group is completely dead (and not on "last 
> commit" and "per partition" basis). Thus, this JIRA would be a workaround fix 
> if core cannot be changed quickly enough.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KAFKA-6970) Kafka streams lets the user call init() and close() on a state store, when inside Processors

2018-05-30 Thread James Cheng (JIRA)
James Cheng created KAFKA-6970:
--

 Summary: Kafka streams lets the user call init() and close() on a 
state store, when inside Processors
 Key: KAFKA-6970
 URL: https://issues.apache.org/jira/browse/KAFKA-6970
 Project: Kafka
  Issue Type: Bug
Reporter: James Cheng


When using a state store within Transform (and Processor and TransformValues), 
the user is able to call init() and close() on the state stores. Those APIs 
should only be called by kafka streams itself.

If possible, it would be good to guard those APIs so that the user cannot call 
them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-6967) TopologyTestDriver does not allow pre-populating state stores that have change logging

2018-05-29 Thread James Cheng (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-6967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494299#comment-16494299
 ] 

James Cheng commented on KAFKA-6967:


If the very first call to the state store is a `store.put()`, it has no 
timestamp (or context), so the `put()` fails with the above stack trace.

If you instead run some data through topology first, and then afterwards 
manipulate the state store via `store.put()`, it has a context, and so the call 
works fine.

> TopologyTestDriver does not allow pre-populating state stores that have 
> change logging
> --
>
> Key: KAFKA-6967
> URL: https://issues.apache.org/jira/browse/KAFKA-6967
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 1.1.0
>Reporter: James Cheng
>Assignee: Matthias J. Sax
>Priority: Major
>
> TopologyTestDriver does not allow pre-populating a state store that has 
> logging enabled. If you try to do it, you will get the following error 
> message:
>  
> {code:java}
> java.lang.IllegalStateException: This should not happen as timestamp() should 
> only be called while a record is processed
>   at 
> org.apache.kafka.streams.processor.internals.AbstractProcessorContext.timestamp(AbstractProcessorContext.java:153)
>   at 
> org.apache.kafka.streams.state.internals.StoreChangeLogger.logChange(StoreChangeLogger.java:59)
>   at 
> org.apache.kafka.streams.state.internals.ChangeLoggingKeyValueBytesStore.put(ChangeLoggingKeyValueBytesStore.java:69)
>   at 
> org.apache.kafka.streams.state.internals.ChangeLoggingKeyValueBytesStore.put(ChangeLoggingKeyValueBytesStore.java:29)
>   at 
> org.apache.kafka.streams.state.internals.InnerMeteredKeyValueStore.put(InnerMeteredKeyValueStore.java:198)
>   at 
> org.apache.kafka.streams.state.internals.MeteredKeyValueBytesStore.put(MeteredKeyValueBytesStore.java:117)
> {code}
> Also see:
> https://github.com/apache/kafka/blob/trunk/streams/test-utils/src/test/java/org/apache/kafka/streams/TopologyTestDriverTest.java#L723-L740



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (KAFKA-6967) TopologyTestDriver does not allow pre-populating state stores that have change logging

2018-05-29 Thread James Cheng (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-6967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Cheng reassigned KAFKA-6967:
--

Assignee: Matthias J. Sax

> TopologyTestDriver does not allow pre-populating state stores that have 
> change logging
> --
>
> Key: KAFKA-6967
> URL: https://issues.apache.org/jira/browse/KAFKA-6967
> Project: Kafka
>  Issue Type: Bug
>Reporter: James Cheng
>Assignee: Matthias J. Sax
>Priority: Major
>
> TopologyTestDriver does not allow pre-populating a state store that has 
> logging enabled. If you try to do it, you will get the following error 
> message:
>  
> {code:java}
> java.lang.IllegalStateException: This should not happen as timestamp() should 
> only be called while a record is processed
>   at 
> org.apache.kafka.streams.processor.internals.AbstractProcessorContext.timestamp(AbstractProcessorContext.java:153)
>   at 
> org.apache.kafka.streams.state.internals.StoreChangeLogger.logChange(StoreChangeLogger.java:59)
>   at 
> org.apache.kafka.streams.state.internals.ChangeLoggingKeyValueBytesStore.put(ChangeLoggingKeyValueBytesStore.java:69)
>   at 
> org.apache.kafka.streams.state.internals.ChangeLoggingKeyValueBytesStore.put(ChangeLoggingKeyValueBytesStore.java:29)
>   at 
> org.apache.kafka.streams.state.internals.InnerMeteredKeyValueStore.put(InnerMeteredKeyValueStore.java:198)
>   at 
> org.apache.kafka.streams.state.internals.MeteredKeyValueBytesStore.put(MeteredKeyValueBytesStore.java:117)
> {code}
> Also see:
> https://github.com/apache/kafka/blob/trunk/streams/test-utils/src/test/java/org/apache/kafka/streams/TopologyTestDriverTest.java#L723-L740



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KAFKA-6967) TopologyTestDriver does not allow pre-populating state stores that have change logging

2018-05-29 Thread James Cheng (JIRA)
James Cheng created KAFKA-6967:
--

 Summary: TopologyTestDriver does not allow pre-populating state 
stores that have change logging
 Key: KAFKA-6967
 URL: https://issues.apache.org/jira/browse/KAFKA-6967
 Project: Kafka
  Issue Type: Bug
Reporter: James Cheng


TopologyTestDriver does not allow pre-populating a state store that has logging 
enabled. If you try to do it, you will get the following error message:

 
{code:java}
java.lang.IllegalStateException: This should not happen as timestamp() should 
only be called while a record is processed
at 
org.apache.kafka.streams.processor.internals.AbstractProcessorContext.timestamp(AbstractProcessorContext.java:153)
at 
org.apache.kafka.streams.state.internals.StoreChangeLogger.logChange(StoreChangeLogger.java:59)
at 
org.apache.kafka.streams.state.internals.ChangeLoggingKeyValueBytesStore.put(ChangeLoggingKeyValueBytesStore.java:69)
at 
org.apache.kafka.streams.state.internals.ChangeLoggingKeyValueBytesStore.put(ChangeLoggingKeyValueBytesStore.java:29)
at 
org.apache.kafka.streams.state.internals.InnerMeteredKeyValueStore.put(InnerMeteredKeyValueStore.java:198)
at 
org.apache.kafka.streams.state.internals.MeteredKeyValueBytesStore.put(MeteredKeyValueBytesStore.java:117)
{code}

Also see:
https://github.com/apache/kafka/blob/trunk/streams/test-utils/src/test/java/org/apache/kafka/streams/TopologyTestDriverTest.java#L723-L740



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-3473) KIP-237: More Controller Health Metrics

2018-05-19 Thread James Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-3473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481754#comment-16481754
 ] 

James Cheng commented on KAFKA-3473:


I was thinking of the [http://kafka.apache.org/documentation/#monitoring] 
section of the kafka site. It has a list of the JMX metrics that are available 
on the brokers. We can add the new metrics to that list.

I think the file to edit is 
https://github.com/apache/kafka/blob/trunk/docs/ops.html

> KIP-237: More Controller Health Metrics
> ---
>
> Key: KAFKA-3473
> URL: https://issues.apache.org/jira/browse/KAFKA-3473
> Project: Kafka
>  Issue Type: Improvement
>  Components: controller
>Affects Versions: 1.0.1
>Reporter: Jiangjie Qin
>Assignee: Dong Lin
>Priority: Major
> Fix For: 2.0.0
>
>
> Currently controller appends the requests to brokers into controller channel 
> manager queue during state transition. i.e. the state transition are 
> propagated asynchronously. We need to track the request queue time on the 
> controller side to see how long the state propagation is delayed after the 
> state transition finished on the controller.
> We also want to have metrics to monitor the ControllerEventManager queue size 
> and the average time it takes for a event to wait in this queue before being 
> processed.
> See 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-237%3A+More+Controller+Health+Metrics
>  for more detail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-3473) KIP-237: More Controller Health Metrics

2018-05-19 Thread James Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-3473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481489#comment-16481489
 ] 

James Cheng commented on KAFKA-3473:


[~lindong], can you update the docs to include this new metric?

> KIP-237: More Controller Health Metrics
> ---
>
> Key: KAFKA-3473
> URL: https://issues.apache.org/jira/browse/KAFKA-3473
> Project: Kafka
>  Issue Type: Improvement
>  Components: controller
>Affects Versions: 1.0.1
>Reporter: Jiangjie Qin
>Assignee: Dong Lin
>Priority: Major
> Fix For: 2.0.0
>
>
> Currently controller appends the requests to brokers into controller channel 
> manager queue during state transition. i.e. the state transition are 
> propagated asynchronously. We need to track the request queue time on the 
> controller side to see how long the state propagation is delayed after the 
> state transition finished on the controller.
> We also want to have metrics to monitor the ControllerEventManager queue size 
> and the average time it takes for a event to wait in this queue before being 
> processed.
> See 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-237%3A+More+Controller+Health+Metrics
>  for more detail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-6830) Add new metrics for consumer/replication fetch requests

2018-04-28 Thread James Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16457470#comment-16457470
 ] 

James Cheng commented on KAFKA-6830:


New metrics require a KIP. Would you be able to put one together? 
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals

> Add new metrics for consumer/replication fetch requests
> ---
>
> Key: KAFKA-6830
> URL: https://issues.apache.org/jira/browse/KAFKA-6830
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Reporter: Adam Kotwasinski
>Priority: Major
>
> Currently, we have only one fetch request-related metric for a topic.
>  As fetch requests are used by both client consumers and replicating brokers, 
> it is impossible to tell if the particular partition (with replication factor 
> > 1) is being actively read from client by consumers.
> Rationale for this improvement: as owner of kafka installation, but not the 
> owner of clients, I want to know which topics still have active (real) 
> consumers.
> PR linked.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-6746) Allow ZK Znode configurable for Kafka broker

2018-04-04 Thread James Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16426344#comment-16426344
 ] 

James Cheng commented on KAFKA-6746:


The zookeeper.connect config value actually lets you specify a chroot path, but 
it appears that the documentation doesn't actually say that. At least, not in 
the right place.

Here is something from the old consumer section: 
[http://kafka.apache.org/documentation/#oldconsumerconfigs]
{noformat}
The server may also have a ZooKeeper chroot path as part of its ZooKeeper 
connection string which puts its data under some path in the global ZooKeeper 
namespace. If so the consumer should use the same chroot path in its connection 
string. For example to give a chroot path of /chroot/path you would give the 
connection string as 
hostname1:port1,hostname2:port2,hostname3:port3/chroot/path.
{noformat}
Can you test if that works? And if so, would you be willing to open a pull 
request to improve the documentation about this?

> Allow ZK Znode configurable for Kafka broker 
> -
>
> Key: KAFKA-6746
> URL: https://issues.apache.org/jira/browse/KAFKA-6746
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Biju Nair
>Priority: Major
>
> By allowing users to specify the {{Znode}} to be used along with the {{ZK 
> Quorum}}, users will be able to reuse a {{ZK}} cluster for many {{Kafka}} 
> clusters. This will help in reducing the {{ZK}} cluster footprint especially 
> in non production environment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-4893) async topic deletion conflicts with max topic length

2018-03-19 Thread James Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405558#comment-16405558
 ] 

James Cheng commented on KAFKA-4893:


We ran into another error which may be related to this:

{code:java}
[2018-03-17 03:11:39,655] ERROR There was an error in one of the threads during 
logs loading: java.lang.IllegalArgumentException: Duplicate log directories 
found: 
/redacted/kafka/disk-xvdg/log/mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.foofoofoofoofoofoofoofoofoofoofo-3,
 
/redacted/kafka/disk-xvdh/log/mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.foofoofoofoofoofoofoofoofoofoofo-3!
 (kafka.log.LogManager)
[2018-03-17 03:11:39,664] FATAL [Kafka Server 1], Fatal error during 
KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
java.lang.IllegalArgumentException: Duplicate log directories found: 
/redacted/kafka/disk-xvdg/log/mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.foofoofoofoofoofoofoofoofoofoofo-3,
 
/redacted/kafka/disk-xvdh/log/mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.mirror.foofoofoofoofoofoofoofoofoofoofo-3!
{code}

We ran into this error upon broker startup, 2 weeks after we recovered from the 
above scenario. Our theory (unconfirmed) is that:
1) The situation in this JIRA happened, leaving the topic in a partially 
deleted state.
2) The topic was recreated, and (somehow) made it past the partition-deleted 
state. The directory for the partition was created on a different log.dir.
3) Time passes.
4) Upon startup, kafka validates all log.dirs, and found the same partition 
exists in different log.dirs

We didn't have time to verify this theory, but we thought we would leave it 
here in case someone else runs into it and has time to look into it.

> async topic deletion conflicts with max topic length
> 
>
> Key: KAFKA-4893
> URL: https://issues.apache.org/jira/browse/KAFKA-4893
> Project: Kafka
>  Issue Type: Bug
>Reporter: Onur Karaman
>Assignee: Vahid Hashemian
>Priority: Minor
> Fix For: 2.0.0
>
>
> As per the 
> [documentation|http://kafka.apache.org/documentation/#basic_ops_add_topic], 
> topics can be only 249 characters long to line up with typical filesystem 
> limitations:
> {quote}
> Each sharded partition log is placed into its own folder under the Kafka log 
> directory. The name of such folders consists of the topic name, appended by a 
> dash (\-) and the partition id. Since a typical folder name can not be over 
> 255 characters long, there will be a limitation on the length of topic names. 
> We assume the number of partitions will not ever be above 100,000. Therefore, 
> topic names cannot be longer than 249 characters. This leaves just enough 
> room in the folder name for a dash and a potentially 5 digit long partition 
> id.
> {quote}
> {{kafka.common.Topic.maxNameLength}} is set to 249 and is used during 
> validation.
> This limit ends up not being quite right since topic deletion ends up 
> renaming the directory to the form {{topic-partition.uniqueId-delete}} as can 
> be seen in {{LogManager.asyncDelete}}:
> {code}
> val dirName = new StringBuilder(removedLog.name)
>   .append(".")
>   
> .append(java.util.UUID.randomUUID.toString.replaceAll("-",""))
>   .append(Log.DeleteDirSuffix)
>   .toString()
> {code}
> So the unique id and "-delete" suffix end up hogging some of the characters. 
> Deleting a long-named topic results in a log message such as the following:
> {code}
> kafka.common.KafkaStorageException: Failed to rename log directory from 
> /tmp/kafka-logs0/0-0
>  to 
> 

[jira] [Commented] (KAFKA-4893) async topic deletion conflicts with max topic length

2018-03-19 Thread James Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405553#comment-16405553
 ] 

James Cheng commented on KAFKA-4893:


We just ran into this. Our mirroring setup automatically prepends 'mirror.` to 
any topic that it mirrors. We accidentally created a mirroring loop. So we 
created
{noformat}
foo
mirror.foo
mirror.mirror.foo
mirror.mirror.mirror.foo
...
{noformat}
And eventually hit max topic length. And then when we tried to delete the 
longest one, we ran into this. We eventually had to manually delete it by:
 1) Taking down the brokers
 2) Deleting the topic from zookeeper
 3) Deleting the log dirs from disk

> async topic deletion conflicts with max topic length
> 
>
> Key: KAFKA-4893
> URL: https://issues.apache.org/jira/browse/KAFKA-4893
> Project: Kafka
>  Issue Type: Bug
>Reporter: Onur Karaman
>Assignee: Vahid Hashemian
>Priority: Minor
> Fix For: 2.0.0
>
>
> As per the 
> [documentation|http://kafka.apache.org/documentation/#basic_ops_add_topic], 
> topics can be only 249 characters long to line up with typical filesystem 
> limitations:
> {quote}
> Each sharded partition log is placed into its own folder under the Kafka log 
> directory. The name of such folders consists of the topic name, appended by a 
> dash (\-) and the partition id. Since a typical folder name can not be over 
> 255 characters long, there will be a limitation on the length of topic names. 
> We assume the number of partitions will not ever be above 100,000. Therefore, 
> topic names cannot be longer than 249 characters. This leaves just enough 
> room in the folder name for a dash and a potentially 5 digit long partition 
> id.
> {quote}
> {{kafka.common.Topic.maxNameLength}} is set to 249 and is used during 
> validation.
> This limit ends up not being quite right since topic deletion ends up 
> renaming the directory to the form {{topic-partition.uniqueId-delete}} as can 
> be seen in {{LogManager.asyncDelete}}:
> {code}
> val dirName = new StringBuilder(removedLog.name)
>   .append(".")
>   
> .append(java.util.UUID.randomUUID.toString.replaceAll("-",""))
>   .append(Log.DeleteDirSuffix)
>   .toString()
> {code}
> So the unique id and "-delete" suffix end up hogging some of the characters. 
> Deleting a long-named topic results in a log message such as the following:
> {code}
> kafka.common.KafkaStorageException: Failed to rename log directory from 
> /tmp/kafka-logs0/0-0
>  to 
> /tmp/kafka-logs0/0-0.797bba3fb2464729840f87769243edbb-delete
>   at kafka.log.LogManager.asyncDelete(LogManager.scala:439)
>   at 
> kafka.cluster.Partition$$anonfun$delete$1.apply$mcV$sp(Partition.scala:142)
>   at kafka.cluster.Partition$$anonfun$delete$1.apply(Partition.scala:137)
>   at kafka.cluster.Partition$$anonfun$delete$1.apply(Partition.scala:137)
>   at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:213)
>   at kafka.utils.CoreUtils$.inWriteLock(CoreUtils.scala:221)
>   at kafka.cluster.Partition.delete(Partition.scala:137)
>   at kafka.server.ReplicaManager.stopReplica(ReplicaManager.scala:230)
>   at 
> kafka.server.ReplicaManager$$anonfun$stopReplicas$2.apply(ReplicaManager.scala:260)
>   at 
> kafka.server.ReplicaManager$$anonfun$stopReplicas$2.apply(ReplicaManager.scala:259)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at kafka.server.ReplicaManager.stopReplicas(ReplicaManager.scala:259)
>   at kafka.server.KafkaApis.handleStopReplicaRequest(KafkaApis.scala:174)
>   at kafka.server.KafkaApis.handle(KafkaApis.scala:86)
>   at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:64)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> The topic after this point still exists but has Leader set to -1 and the 
> controller recognizes the topic completion as incomplete (the topic znode is 
> still in /admin/delete_topics).
> I don't believe linkedin has any topic name this long but I'm making the 
> ticket in case anyone runs into this problem.



--
This message was sent by 

[jira] [Commented] (KAFKA-6264) Log cleaner thread may die on legacy segment containing messages whose offsets are too large

2018-03-14 Thread James Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16399698#comment-16399698
 ] 

James Cheng commented on KAFKA-6264:


We recently ran into this on one of our clusters.
{code:java}
[2018-02-21 14:24:17,638] INFO Cleaner 0: Beginning cleaning of log 
__consumer_offsets-45. (kafka.log.LogCleaner) 
[2018-02-21 14:24:17,638] INFO Cleaner 0: Building offset map for 
__consumer_offsets-45... (kafka.log.LogCleaner) 
[2018-02-21 14:24:17,796] INFO Cleaner 0: Building offset map for log 
__consumer_offsets-45 for 20 segments in offset range [0, 2469849110). 
(kafka.log.LogCleaner) 
[2018-02-21 14:24:19,143] INFO Cleaner 0: Offset map for log 
__consumer_offsets-45 complete. (kafka.log.LogCleaner) 
[2018-02-21 14:24:19,143] INFO Cleaner 0: Cleaning log __consumer_offsets-45 
(cleaning prior to Wed Feb 14 07:56:02 UTC 2018, discarding tombstones prior to 
Thu Jan 01 00:00:00 UTC 1970)... (kafka.log.LogCleaner) 
[2018-02-21 14:24:19,144] INFO Cleaner 0: Cleaning segment 0 in log 
__consumer_offsets-45 (largest timestamp Tue Feb 28 02:56:04 UTC 2017) into 0, 
retaining deletes. (kafka.log.LogCleaner) 
[2018-02-21 14:24:19,155] ERROR [kafka-log-cleaner-thread-0]: Error due to 
(kafka.log.LogCleaner) 
java.lang.IllegalArgumentException: requirement failed: largest offset in 
message set can not be safely converted to relative offset. 
at scala.Predef$.require(Predef.scala:224) 
at kafka.log.LogSegment.append(LogSegment.scala:121) 
at kafka.log.Cleaner.cleanInto(LogCleaner.scala:547) 
at kafka.log.Cleaner.cleanSegments(LogCleaner.scala:443) 
at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.scala:385) 
at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.scala:384) 
at scala.collection.immutable.List.foreach(List.scala:392) 
at kafka.log.Cleaner.doClean(LogCleaner.scala:384) 
at kafka.log.Cleaner.clean(LogCleaner.scala:361) 
at kafka.log.LogCleaner$CleanerThread.cleanOrSleep(LogCleaner.scala:256) 
at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:236) 
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:64) 
[2018-02-21 14:24:19,169] INFO [kafka-log-cleaner-thread-0]: Stopped 
(kafka.log.LogCleaner){code}
We used DumpLogSegments to look at the difference between the first and last 
offset for each of the .log files. I didn't find any where the difference was 
greater than MAXINT (2^31-1).

However, when reading https://issues.apache.org/jira/browse/KAFKA-6264, it 
mentioned that the filename is relevant and is part of this calculation. So I 
compared the filenames vs the last offset.
{code:java}
sh-4.2# ls -l *.log 
-rw-r--r--. 1 root root 12195 Apr 27 2017 .log 
-rw-r--r--. 1 root root 236030 Oct 10 23:07 002469031807.log 
-rw-r--r--. 1 root root 1925969 Oct 18 18:46 002469285574.log 
-rw-r--r--. 1 root root 78693 Oct 25 22:26 002469290346.log 
-rw-r--r--. 1 root root 94544 Nov 1 17:29 002469290862.log 
-rw-r--r--. 1 root root 270094 Nov 8 22:51 002469291421.log 
-rw-r--r--. 1 root root 497500 Nov 16 06:24 002469292884.log 
-rw-r--r--. 1 root root 5476783 Nov 23 21:19 002469295854.log 
-rw-r--r--. 1 root root 18245427 Dec 1 08:45 002469321787.log 
-rw-r--r--. 1 root root 7527104 Dec 8 07:38 002469414611.log 
-rw-r--r--. 1 root root 6751417 Dec 15 09:36 002469453997.log 
-rw-r--r--. 1 root root 5904919 Dec 22 10:01 002469487104.log 
-rw-r--r--. 1 root root 293325 Dec 29 10:02 002469528006.log 
-rw-r--r--. 1 root root 798008 Jan 5 23:02 002469529772.log 
-rw-r--r--. 1 root root 9366028 Jan 13 10:02 002469535454.log 
-rw-r--r--. 1 root root 7357635 Jan 20 10:01 002469577027.log 
-rw-r--r--. 1 root root 1073890 Jan 27 10:05 002469615817.log 
-rw-r--r--. 1 root root 6390581 Feb 3 13:15 002469622670.log 
-rw-r--r--. 1 root root 44379090 Feb 11 04:01 002469652193.log 
-rw-r--r--. 1 root root 970717 Feb 17 16:27 002469845650.log 
-rw-r--r--. 1 root root 468160 Feb 24 10:01 002469850446.log 
-rw-r--r--. 1 root root 104857353 Feb 28 16:55 002469854492.log 
-rw-r--r--. 1 root root 104857445 Feb 28 17:22 002470671339.log 
-rw-r--r--. 1 root root 104857445 Feb 28 18:23 002471488410.log 
-rw-r--r--. 1 root root 104857445 Feb 28 19:49 002472305481.log 
-rw-r--r--. 1 root root 104857445 Feb 28 21:05 002473122552.log 
-rw-r--r--. 1 root root 104857445 Feb 28 23:13 002473939623.log 
-rw-r--r--. 1 root root 104857445 Mar 1 02:05 002474756694.log 
-rw-r--r--. 1 root root 104857445 Mar 1 02:47 002475573765.log 
-rw-r--r--. 1 root root 104857445 Mar 1 03:19 002476390836.log 
-rw-r--r--. 1 root root 104857438 Mar 1 05:33 002477207907.log 
-rw-r--r--. 1 root root 104857572 Mar 1 06:24 002478024876.log 
-rw-r--r--. 1 root root 104857445 

[jira] [Commented] (KAFKA-6588) Add a metric to monitor live log cleaner thread

2018-02-23 Thread James Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16375316#comment-16375316
 ] 

James Cheng commented on KAFKA-6588:


KAFKA-3857 added a metric "time-since-last-run-ms". If it's consistently a 
"small" number, then the cleaner is still working. If it grows really high, 
then the log cleaner thread is dead. PR is here: 
https://github.com/apache/kafka/pull/1593

Would that be sufficient?

 

> Add a metric to monitor live log cleaner thread
> ---
>
> Key: KAFKA-6588
> URL: https://issues.apache.org/jira/browse/KAFKA-6588
> Project: Kafka
>  Issue Type: Bug
>Reporter: Navina Ramesh
>Assignee: Navina Ramesh
>Priority: Minor
>
> We want to have a more direct metric to monitor the log cleaner thread. 
> Hence, adding a simple metric in `LogCleaner.scala`. 
> Additionally, making a minor change to make sure the correct offsets are 
> logged in `LogCleaner#recordStats` 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-6505) Add simple raw "offset-commit-failures", "offset-commits" and "offset-commit-successes" count metric

2018-02-01 Thread James Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16348232#comment-16348232
 ] 

James Cheng commented on KAFKA-6505:


[~steff1193]: Yes, a KIP is still required even if you are only adding new 
metrics.

The reason for this is that the metrics are a "monitoring interface" that users 
(operators?) will rely on, and that needs to be supported long-term. So the KIP 
is the place where we have the discussions about use cases, naming, etc.

> Add simple raw "offset-commit-failures", "offset-commits" and 
> "offset-commit-successes" count metric
> 
>
> Key: KAFKA-6505
> URL: https://issues.apache.org/jira/browse/KAFKA-6505
> Project: Kafka
>  Issue Type: Improvement
>  Components: KafkaConnect
>Affects Versions: 1.0.0
>Reporter: Per Steffensen
>Priority: Minor
>  Labels: needs-kip
>
> MBean 
> "kafka.connect:type=connector-task-metrics,connector=,task=x" 
> has several attributes. Most of them seems to be avg/max/pct over the entire 
> lifetime of the process. They are not very useful when monitoring a system, 
> where you typically want to see when there have been problems and if there 
> are problems right now.
> E.g. I would like to expose to an administrator when offset-commits have been 
> failing (e.g. timing out) including if they are failing right now. It is 
> really hard to do that properly, just using attribute 
> "offset-commit-failure-percentage". You can expose a number telling how much 
> the percentage has changed between two consecutive polls of the metric - if 
> it changed to the positive side, we saw offset-commit failures, and if it 
> changed to the negative side (or is stable at 0) we saw offset-commit success 
> - at least as long as the system has not been running for so long that a 
> single failing offset-commit does not even change the percentage. But it is 
> really odd, to do it this way.
> *I would like to just see an attribute "offset-commit-failures" just counting 
> how many offset-commits have failed, as an ever-increasing number. Maybe also 
> attributes "offset-commits" and "offset-commit-successes". Then I can do a 
> delta between the two last metric-polls to show how many 
> offset-commit-attempts have failed "very recently". Let this ticket be about 
> that particular added attribute (or the three added attributes).*
> Just a note on metrics IMHO (should probably be posted somewhere else):
> In general consider getting rid of stuff like avg, max, pct over the entire 
> lifetime of the process - current state is what interests people, especially 
> when it comes to failure-related metrics (failure-pct over the lifetime of 
> the process is not very useful). And people will continuously be polling and 
> storing the metrics, so we will have a history of "current state" somewhere 
> else (e.g. in Prometheus). Just give us the raw counts. Modern monitoring 
> tools can do all the avg, max, pct for you based on a time-series of 
> metrics-poll-results - and they can do it for periods of your choice (e.g. 
> average over the last minute or 5 minutes) - have a look at Prometheus PromQL 
> (e.g. used through Grafana). Just expose the raw number and let the 
> average/max/min/pct calculation be done on the collect/presentation side. 
> Only do "advanced" stuff for cases that are very interesting and where it 
> cannot be done based on simple raw number (e.g. percentiles), and consider 
> whether doing it for fairly short intervals is better than for the entire 
> lifetime of the process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-6166) Streams configuration requires consumer. and producer. in order to be read

2018-01-30 Thread James Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16345577#comment-16345577
 ] 

James Cheng commented on KAFKA-6166:


This PR didn't have any docs updates. Should it have, or is the new behavior 
intuitive enough? Or was this considered a regression, and the original 
behavior was already documented?

> Streams configuration requires consumer. and producer. in order to be read
> --
>
> Key: KAFKA-6166
> URL: https://issues.apache.org/jira/browse/KAFKA-6166
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.11.0.0
> Environment: Kafka 0.11.0.0
> JDK 1.8
> CoreOS
>Reporter: Justin Manchester
>Assignee: Filipe Agapito
>Priority: Minor
>  Labels: newbie++, user-experience
>
> Problem:
> In previous release you could specify a custom metrics reporter like so:
> Properties config = new Properties(); 
> config.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "kafka-broker1:9092"); 
> config.put(StreamsConfig.METRIC_REPORTER_CLASSES_CONFIG, 
> "com.mycompany.MetricReporter"); 
> config.put("custom-key-for-metric-reporter", "value");
> From 0.11.0.0 onwards this is no longer possible, as you have to specify 
> consumer.custom-key-for-metric-reporter or 
> producer.custom-key-for-metric-reporter otherwise it's stripped out of the 
> configuration.
> So, if you wish to use a metrics reporter and to collect producer and 
> consumer metrics, as well as kafka-streams metrics, that you would need to 
> specify 3 distinct configs:
> 1) consumer.custom-key-for-metric-reporter 
> 2) producer.custom-key-for-metric-reporter 
> 3) custom-key-for-metric-reporter
> This appears to be a regression.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-6469) ISR change notification queue can prevent controller from making progress

2018-01-22 Thread James Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335399#comment-16335399
 ] 

James Cheng commented on KAFKA-6469:


In addition to this bug, I'm kind of curious how you managed to run a cluster 
with 60k partitions per broker. We run 17k per broker, and often run into 
stability problems.

> ISR change notification queue can prevent controller from making progress
> -
>
> Key: KAFKA-6469
> URL: https://issues.apache.org/jira/browse/KAFKA-6469
> Project: Kafka
>  Issue Type: Bug
>Reporter: Kyle Ambroff-Kao
>Assignee: Kyle Ambroff-Kao
>Priority: Major
>
> When the writes /isr_change_notification in ZooKeeper (which is effectively a 
> queue of ISR change events for the controller) happen at a rate high enough 
> that the node with a watch can't dequeue them, the trouble starts.
> The watcher kafka.controller.IsrChangeNotificationListener is fired in the 
> controller when a new entry is written to /isr_change_notification, and the 
> zkclient library sends a GetChildrenRequest to zookeeper to fetch all child 
> znodes.
> We've failures in one of our test clusters as the partition count started to 
> climb north of 60k per broker. We had brokers writing child nodes under 
> /isr_change_notification that were larger than the jute.maxbuffer size in 
> ZooKeeper (1MB), causing the ZooKeeper server to drop the controller's 
> session, effectively bricking the cluster.
> This can be partially mitigated by chunking ISR notifications to increase the 
> maximum number of partitions a broker can host.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-3741) Allow setting of default topic configs via StreamsConfig

2018-01-19 Thread James Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-3741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16331937#comment-16331937
 ] 

James Cheng commented on KAFKA-3741:


Some of these docs seem to have disappeared from top of trunk, or even 1.0.0. 
In particular, the stuff about how to use the TOPIC_PREFIX stuff (see 
[https://github.com/dguy/kafka/blob/9cd0e0fbb5c972151ba0081e8f1c366db5bd12e8/docs/streams/developer-guide.html#L498-L507)]

Does anyone know where those docs have gone to?

> Allow setting of default topic configs via StreamsConfig
> 
>
> Key: KAFKA-3741
> URL: https://issues.apache.org/jira/browse/KAFKA-3741
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Affects Versions: 0.10.0.0
>Reporter: Roger Hoover
>Assignee: Damian Guy
>Priority: Major
>  Labels: api
> Fix For: 1.0.0
>
>
> Kafka Streams currently allows you to specify a replication factor for 
> changelog and repartition topics that it creates.  It should also allow you 
> to specify any other TopicConfig. These should be used as defaults when 
> creating Internal topics. The defaults should be overridden by any configs 
> provided by the StateStoreSuppliers etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-6422) When enable trace level log in mirror maker, it will throw null pointer exception and the mirror maker will shutdown

2018-01-08 Thread James Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16316749#comment-16316749
 ] 

James Cheng commented on KAFKA-6422:


I have no ability to merge PRs, but the change looks good to me.

> When enable trace level log in mirror maker, it will throw null pointer 
> exception and the mirror maker will shutdown
> 
>
> Key: KAFKA-6422
> URL: https://issues.apache.org/jira/browse/KAFKA-6422
> Project: Kafka
>  Issue Type: Bug
>  Components: tools
>Affects Versions: 0.10.0.0, 0.10.1.0, 0.10.2.0, 0.11.0.0, 0.11.0.1, 
> 0.11.0.2
>Reporter: Xin Li
>Assignee: Xin Li
>Priority: Minor
>  Labels: easyfix
> Fix For: 0.11.0.0
>
>
> https://github.com/apache/kafka/blob/0.10.0/core/src/main/scala/kafka/tools/MirrorMaker.scala#L414
> when enable trace level log in mirror maker, if the message value is null, it 
> will throw null pointer exception, and mirror maker will shutdown because of 
> that.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6422) When enable trace level log in mirror maker, it will throw null pointer exception and the mirror maker will shutdown

2018-01-06 Thread James Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16315086#comment-16315086
 ] 

James Cheng commented on KAFKA-6422:


[~Xin Li], you should assign this JIRA to yourself (Edit -> Assignee). I tried 
assigning it to you, but I couldn't find you in the dropdown.

If you aren't able to assign JIRAs to yourself, then you can send email to the 
d...@kafka.apache.org mailing list to ask for permissions to do so.

> When enable trace level log in mirror maker, it will throw null pointer 
> exception and the mirror maker will shutdown
> 
>
> Key: KAFKA-6422
> URL: https://issues.apache.org/jira/browse/KAFKA-6422
> Project: Kafka
>  Issue Type: Bug
>  Components: tools
>Affects Versions: 0.10.0.0, 0.10.1.0, 0.10.2.0, 0.11.0.0, 0.11.0.1, 
> 0.11.0.2
>Reporter: Xin Li
>Priority: Minor
>
> https://github.com/apache/kafka/blob/0.10.0/core/src/main/scala/kafka/tools/MirrorMaker.scala#L414
> when enable trace level log in mirror maker, if the message value is null, it 
> will throw null pointer exception, and mirror maker will shutdown because of 
> that.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (KAFKA-3955) Kafka log recovery doesn't truncate logs on non-monotonic offsets, leading to failed broker boot

2017-12-15 Thread James Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-3955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293291#comment-16293291
 ] 

James Cheng edited comment on KAFKA-3955 at 12/15/17 9:45 PM:
--

[~ijuma], I said in my comment above that:
{quote}
Note that it is only safe to do this if the partition is online (being lead by 
someone other than the affected broker). If the partition is offline and the 
preferred replica for it is the one with this problem, then deleting the log 
directory on the leader will cause 
https://issues.apache.org/jira/browse/KAFKA-3410 to happen (because the leader 
will now have no data, but followers do have data, and so follower will be 
ahead of leader).

{quote}

If this happens on the offline partition leader, then what are my options? I 
think that I will need to first, delete all the logs for this partition from 
the offline partition leader's log.dir, and then I think I will need to turn on 
unclean.leader.election for that topic (in order to prevent KAFKA-3410)

Does that sound right?


was (Author: wushujames):
[~ijuma], I said in my comment that:
{quote}
Note that it is only safe to do this if the partition is online (being lead by 
someone other than the affected broker). If the partition is offline and the 
preferred replica for it is the one with this problem, then deleting the log 
directory on the leader will cause 
https://issues.apache.org/jira/browse/KAFKA-3410 to happen (because the leader 
will now have no data, but followers do have data, and so follower will be 
ahead of leader).

{quote}

If this happens on the offline partition leader, then what are my options? I 
think that I will need to first, delete all the logs for this partition from 
the offline partition leader's log.dir, and then I think I will need to turn on 
unclean.leader.election for that topic (in order to prevent KAFKA-3410)

Does that sound right?

> Kafka log recovery doesn't truncate logs on non-monotonic offsets, leading to 
> failed broker boot
> 
>
> Key: KAFKA-3955
> URL: https://issues.apache.org/jira/browse/KAFKA-3955
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.0.0, 0.10.1.1, 0.11.0.0, 1.0.0
>Reporter: Tom Crayford
>Assignee: Ismael Juma
>Priority: Critical
>  Labels: reliability
> Fix For: 1.1.0
>
>
> Hi,
> I've found a bug impacting kafka brokers on startup after an unclean 
> shutdown. If a log segment is corrupt and has non-monotonic offsets (see the 
> appendix of this bug for a sample output from {{DumpLogSegments}}), then 
> {{LogSegment.recover}} throws an {{InvalidOffsetException}} error: 
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/OffsetIndex.scala#L218
> That code is called by {{LogSegment.recover}}: 
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/LogSegment.scala#L191
> Which is called in several places in {{Log.scala}}. Notably it's called four 
> times during recovery:
> Thrice in Log.loadSegments
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/Log.scala#L199
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/Log.scala#L204
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/Log.scala#L226
> and once in Log.recoverLog
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/Log.scala#L268
> Of these, only the very last one has a {{catch}} for 
> {{InvalidOffsetException}}. When that catches the issue, it truncates the 
> whole log (not just this segment): 
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/Log.scala#L274
>  to the start segment of the bad log segment.
> However, this code can't be hit on recovery, because of the code paths in 
> {{loadSegments}} - they mean we'll never hit truncation here, as we always 
> throw this exception and that goes all the way to the toplevel exception 
> handler and crashes the JVM.
> As {{Log.recoverLog}} is always called during recovery, I *think* a fix for 
> this is to move this crash recovery/truncate code inside a new method in 
> {{Log.scala}}, and call that instead of {{LogSegment.recover}} in each place. 
> That code should return the number of {{truncatedBytes}} like we do in 
> {{Log.recoverLog}} and then truncate the log. The callers will have to be 
> notified "stop iterating over files in the directory", likely via a return 
> value of {{truncatedBytes}} like {{Log.recoverLog` does right now.
> I'm happy working on a patch for this. I'm aware this recovery code is tricky 
> and important to get right.
> I'm also curious (and currently don't have good theories as of yet) as to how 
> this log segment got into this state with 

[jira] [Commented] (KAFKA-3410) Unclean leader election and "Halting because log truncation is not allowed"

2017-12-15 Thread James Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293294#comment-16293294
 ] 

James Cheng commented on KAFKA-3410:


Do we think KAFKA-1211 resolves this? If so, should we set Fix Version and 
Resolution on this JIRA?

> Unclean leader election and "Halting because log truncation is not allowed"
> ---
>
> Key: KAFKA-3410
> URL: https://issues.apache.org/jira/browse/KAFKA-3410
> Project: Kafka
>  Issue Type: Bug
>Reporter: James Cheng
>  Labels: reliability
>
> I ran into a scenario where one of my brokers would continually shutdown, 
> with the error message:
> [2016-02-25 00:29:39,236] FATAL [ReplicaFetcherThread-0-1], Halting because 
> log truncation is not allowed for topic test, Current leader 1's latest 
> offset 0 is less than replica 2's latest offset 151 
> (kafka.server.ReplicaFetcherThread)
> I managed to reproduce it with the following scenario:
> 1. Start broker1, with unclean.leader.election.enable=false
> 2. Start broker2, with unclean.leader.election.enable=false
> 3. Create topic, single partition, with replication-factor 2.
> 4. Write data to the topic.
> 5. At this point, both brokers are in the ISR. Broker1 is the partition 
> leader.
> 6. Ctrl-Z on broker2. (Simulates a GC pause or a slow network) Broker2 gets 
> dropped out of ISR. Broker1 is still the leader. I can still write data to 
> the partition.
> 7. Shutdown Broker1. Hard or controlled, doesn't matter.
> 8. rm -rf the log directory of broker1. (This simulates a disk replacement or 
> full hardware replacement)
> 9. Resume broker2. It attempts to connect to broker1, but doesn't succeed 
> because broker1 is down. At this point, the partition is offline. Can't write 
> to it.
> 10. Resume broker1. Broker1 resumes leadership of the topic. Broker2 attempts 
> to join ISR, and immediately halts with the error message:
> [2016-02-25 00:29:39,236] FATAL [ReplicaFetcherThread-0-1], Halting because 
> log truncation is not allowed for topic test, Current leader 1's latest 
> offset 0 is less than replica 2's latest offset 151 
> (kafka.server.ReplicaFetcherThread)
> I am able to recover by setting unclean.leader.election.enable=true on my 
> brokers.
> I'm trying to understand a couple things:
> * In step 10, why is broker1 allowed to resume leadership even though it has 
> no data?
> * In step 10, why is it necessary to stop the entire broker due to one 
> partition that is in this state? Wouldn't it be possible for the broker to 
> continue to serve traffic for all the other topics, and just mark this one as 
> unavailable?
> * Would it make sense to allow an operator to manually specify which broker 
> they want to become the new master? This would give me more control over how 
> much data loss I am willing to handle. In this case, I would want broker2 to 
> become the new master. Or, is that possible and I just don't know how to do 
> it?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-3955) Kafka log recovery doesn't truncate logs on non-monotonic offsets, leading to failed broker boot

2017-12-15 Thread James Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-3955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293291#comment-16293291
 ] 

James Cheng commented on KAFKA-3955:


[~ijuma], I said in my comment that:
{quote}
Note that it is only safe to do this if the partition is online (being lead by 
someone other than the affected broker). If the partition is offline and the 
preferred replica for it is the one with this problem, then deleting the log 
directory on the leader will cause 
https://issues.apache.org/jira/browse/KAFKA-3410 to happen (because the leader 
will now have no data, but followers do have data, and so follower will be 
ahead of leader).

{quote}

If this happens on the offline partition leader, then what are my options? I 
think that I will need to first, delete all the logs for this partition from 
the offline partition leader's log.dir, and then I think I will need to turn on 
unclean.leader.election for that topic (in order to prevent KAFKA-3410)

Does that sound right?

> Kafka log recovery doesn't truncate logs on non-monotonic offsets, leading to 
> failed broker boot
> 
>
> Key: KAFKA-3955
> URL: https://issues.apache.org/jira/browse/KAFKA-3955
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.0.0, 0.10.1.1, 0.11.0.0, 1.0.0
>Reporter: Tom Crayford
>Assignee: Ismael Juma
>Priority: Critical
>  Labels: reliability
> Fix For: 1.1.0
>
>
> Hi,
> I've found a bug impacting kafka brokers on startup after an unclean 
> shutdown. If a log segment is corrupt and has non-monotonic offsets (see the 
> appendix of this bug for a sample output from {{DumpLogSegments}}), then 
> {{LogSegment.recover}} throws an {{InvalidOffsetException}} error: 
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/OffsetIndex.scala#L218
> That code is called by {{LogSegment.recover}}: 
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/LogSegment.scala#L191
> Which is called in several places in {{Log.scala}}. Notably it's called four 
> times during recovery:
> Thrice in Log.loadSegments
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/Log.scala#L199
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/Log.scala#L204
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/Log.scala#L226
> and once in Log.recoverLog
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/Log.scala#L268
> Of these, only the very last one has a {{catch}} for 
> {{InvalidOffsetException}}. When that catches the issue, it truncates the 
> whole log (not just this segment): 
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/Log.scala#L274
>  to the start segment of the bad log segment.
> However, this code can't be hit on recovery, because of the code paths in 
> {{loadSegments}} - they mean we'll never hit truncation here, as we always 
> throw this exception and that goes all the way to the toplevel exception 
> handler and crashes the JVM.
> As {{Log.recoverLog}} is always called during recovery, I *think* a fix for 
> this is to move this crash recovery/truncate code inside a new method in 
> {{Log.scala}}, and call that instead of {{LogSegment.recover}} in each place. 
> That code should return the number of {{truncatedBytes}} like we do in 
> {{Log.recoverLog}} and then truncate the log. The callers will have to be 
> notified "stop iterating over files in the directory", likely via a return 
> value of {{truncatedBytes}} like {{Log.recoverLog` does right now.
> I'm happy working on a patch for this. I'm aware this recovery code is tricky 
> and important to get right.
> I'm also curious (and currently don't have good theories as of yet) as to how 
> this log segment got into this state with non-monotonic offsets. This segment 
> is using gzip compression, and is under 0.9.0.1. The same bug with respect to 
> recovery exists in trunk, but I'm unsure if the new handling around 
> compressed messages (KIP-31) means the bug where non-monotonic offsets get 
> appended is still present in trunk.
> As a production workaround, one can manually truncate that log folder 
> yourself (delete all .index/.log files including and after the one with the 
> bad offset). However, kafka should (and can) handle this case well - with 
> replication we can truncate in broker startup.
> stacktrace and error message:
> {code}
> pri=WARN  t=pool-3-thread-4 at=Log Found a corrupted index file, 
> /$DIRECTORY/$TOPIC-22/14306536.index, deleting and rebuilding 
> index...
> pri=ERROR t=main at=LogManager There was an error in one of the threads 
> during logs loading: kafka.common.InvalidOffsetException: Attempt 

[jira] [Commented] (KAFKA-3955) Kafka log recovery doesn't truncate logs on non-monotonic offsets, leading to failed broker boot

2017-12-10 Thread James Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-3955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16285148#comment-16285148
 ] 

James Cheng commented on KAFKA-3955:


log.message.format.version = 0.10.0.1

> Kafka log recovery doesn't truncate logs on non-monotonic offsets, leading to 
> failed broker boot
> 
>
> Key: KAFKA-3955
> URL: https://issues.apache.org/jira/browse/KAFKA-3955
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.0.0, 0.10.1.1, 0.11.0.0, 1.0.0
>Reporter: Tom Crayford
>Priority: Critical
>  Labels: reliability
> Fix For: 1.0.1
>
>
> Hi,
> I've found a bug impacting kafka brokers on startup after an unclean 
> shutdown. If a log segment is corrupt and has non-monotonic offsets (see the 
> appendix of this bug for a sample output from {{DumpLogSegments}}), then 
> {{LogSegment.recover}} throws an {{InvalidOffsetException}} error: 
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/OffsetIndex.scala#L218
> That code is called by {{LogSegment.recover}}: 
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/LogSegment.scala#L191
> Which is called in several places in {{Log.scala}}. Notably it's called four 
> times during recovery:
> Thrice in Log.loadSegments
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/Log.scala#L199
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/Log.scala#L204
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/Log.scala#L226
> and once in Log.recoverLog
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/Log.scala#L268
> Of these, only the very last one has a {{catch}} for 
> {{InvalidOffsetException}}. When that catches the issue, it truncates the 
> whole log (not just this segment): 
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/Log.scala#L274
>  to the start segment of the bad log segment.
> However, this code can't be hit on recovery, because of the code paths in 
> {{loadSegments}} - they mean we'll never hit truncation here, as we always 
> throw this exception and that goes all the way to the toplevel exception 
> handler and crashes the JVM.
> As {{Log.recoverLog}} is always called during recovery, I *think* a fix for 
> this is to move this crash recovery/truncate code inside a new method in 
> {{Log.scala}}, and call that instead of {{LogSegment.recover}} in each place. 
> That code should return the number of {{truncatedBytes}} like we do in 
> {{Log.recoverLog}} and then truncate the log. The callers will have to be 
> notified "stop iterating over files in the directory", likely via a return 
> value of {{truncatedBytes}} like {{Log.recoverLog` does right now.
> I'm happy working on a patch for this. I'm aware this recovery code is tricky 
> and important to get right.
> I'm also curious (and currently don't have good theories as of yet) as to how 
> this log segment got into this state with non-monotonic offsets. This segment 
> is using gzip compression, and is under 0.9.0.1. The same bug with respect to 
> recovery exists in trunk, but I'm unsure if the new handling around 
> compressed messages (KIP-31) means the bug where non-monotonic offsets get 
> appended is still present in trunk.
> As a production workaround, one can manually truncate that log folder 
> yourself (delete all .index/.log files including and after the one with the 
> bad offset). However, kafka should (and can) handle this case well - with 
> replication we can truncate in broker startup.
> stacktrace and error message:
> {code}
> pri=WARN  t=pool-3-thread-4 at=Log Found a corrupted index file, 
> /$DIRECTORY/$TOPIC-22/14306536.index, deleting and rebuilding 
> index...
> pri=ERROR t=main at=LogManager There was an error in one of the threads 
> during logs loading: kafka.common.InvalidOffsetException: Attempt to append 
> an offset (15000337) to position 111719 no larger than the last offset 
> appended (15000337) to /$DIRECTORY/$TOPIC-8/14008931.index.
> pri=FATAL t=main at=KafkaServer Fatal error during KafkaServer startup. 
> Prepare to shutdown kafka.common.InvalidOffsetException: Attempt to append an 
> offset (15000337) to position 111719 no larger than the last offset appended 
> (15000337) to /$DIRECTORY/$TOPIC-8/14008931.index.
> at 
> kafka.log.OffsetIndex$$anonfun$append$1.apply$mcV$sp(OffsetIndex.scala:207)
> at 
> kafka.log.OffsetIndex$$anonfun$append$1.apply(OffsetIndex.scala:197)
> at 
> kafka.log.OffsetIndex$$anonfun$append$1.apply(OffsetIndex.scala:197)
> at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:262)
> at 

[jira] [Commented] (KAFKA-3955) Kafka log recovery doesn't truncate logs on non-monotonic offsets, leading to failed broker boot

2017-12-10 Thread James Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-3955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16285136#comment-16285136
 ] 

James Cheng commented on KAFKA-3955:


We encountered this last week on 0.11.0.1. Thanks to [~tcrayford-heroku]'s 
detailed bug report, we were able to understand and (manually) recover from the 
issue by manually deleting the partition folder that contains the .log/.index 
files.

Note that it is only safe to do this if the partition is online (being lead by 
someone other than the affected broker). If the partition is offline and the 
preferred replica for it is the one with this problem, then deleting the log 
directory on the leader will cause 
https://issues.apache.org/jira/browse/KAFKA-3410 to happen (because the leader 
will now have no data, but followers do have data, and so follower will be 
ahead of leader).


> Kafka log recovery doesn't truncate logs on non-monotonic offsets, leading to 
> failed broker boot
> 
>
> Key: KAFKA-3955
> URL: https://issues.apache.org/jira/browse/KAFKA-3955
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.0.0, 0.10.1.1, 0.11.0.0, 1.0.0
>Reporter: Tom Crayford
>Priority: Critical
>  Labels: reliability
> Fix For: 1.0.1
>
>
> Hi,
> I've found a bug impacting kafka brokers on startup after an unclean 
> shutdown. If a log segment is corrupt and has non-monotonic offsets (see the 
> appendix of this bug for a sample output from {{DumpLogSegments}}), then 
> {{LogSegment.recover}} throws an {{InvalidOffsetException}} error: 
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/OffsetIndex.scala#L218
> That code is called by {{LogSegment.recover}}: 
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/LogSegment.scala#L191
> Which is called in several places in {{Log.scala}}. Notably it's called four 
> times during recovery:
> Thrice in Log.loadSegments
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/Log.scala#L199
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/Log.scala#L204
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/Log.scala#L226
> and once in Log.recoverLog
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/Log.scala#L268
> Of these, only the very last one has a {{catch}} for 
> {{InvalidOffsetException}}. When that catches the issue, it truncates the 
> whole log (not just this segment): 
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/Log.scala#L274
>  to the start segment of the bad log segment.
> However, this code can't be hit on recovery, because of the code paths in 
> {{loadSegments}} - they mean we'll never hit truncation here, as we always 
> throw this exception and that goes all the way to the toplevel exception 
> handler and crashes the JVM.
> As {{Log.recoverLog}} is always called during recovery, I *think* a fix for 
> this is to move this crash recovery/truncate code inside a new method in 
> {{Log.scala}}, and call that instead of {{LogSegment.recover}} in each place. 
> That code should return the number of {{truncatedBytes}} like we do in 
> {{Log.recoverLog}} and then truncate the log. The callers will have to be 
> notified "stop iterating over files in the directory", likely via a return 
> value of {{truncatedBytes}} like {{Log.recoverLog` does right now.
> I'm happy working on a patch for this. I'm aware this recovery code is tricky 
> and important to get right.
> I'm also curious (and currently don't have good theories as of yet) as to how 
> this log segment got into this state with non-monotonic offsets. This segment 
> is using gzip compression, and is under 0.9.0.1. The same bug with respect to 
> recovery exists in trunk, but I'm unsure if the new handling around 
> compressed messages (KIP-31) means the bug where non-monotonic offsets get 
> appended is still present in trunk.
> As a production workaround, one can manually truncate that log folder 
> yourself (delete all .index/.log files including and after the one with the 
> bad offset). However, kafka should (and can) handle this case well - with 
> replication we can truncate in broker startup.
> stacktrace and error message:
> {code}
> pri=WARN  t=pool-3-thread-4 at=Log Found a corrupted index file, 
> /$DIRECTORY/$TOPIC-22/14306536.index, deleting and rebuilding 
> index...
> pri=ERROR t=main at=LogManager There was an error in one of the threads 
> during logs loading: kafka.common.InvalidOffsetException: Attempt to append 
> an offset (15000337) to position 111719 no larger than the last offset 
> appended (15000337) to /$DIRECTORY/$TOPIC-8/14008931.index.
> pri=FATAL t=main 

[jira] [Commented] (KAFKA-6314) Add a tool to delete consumer offsets for a given group

2017-12-08 Thread James Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16283235#comment-16283235
 ] 

James Cheng commented on KAFKA-6314:


FYI, kafka-consumer-groups.sh already has --delete support, but it only works 
for zookeeper-based offsets. 
{noformat}
$ ~/kafka_2.11-1.0.0/bin/kafka-consumer-groups.sh
List all consumer groups, describe a consumer group, delete consumer group 
info, or reset consumer group offsets.
Option  Description
--  ---
--deletePass in groups to delete topic
  partition offsets and ownership
  information over the entire consumer
  group. For instance --group g1 --
  group g2
Pass in groups with a single topic to
  just delete the given topic's
  partition offsets and ownership
  information for the given consumer
  groups. For instance --group g1 --
  group g2 --topic t1
Pass in just a topic to delete the
  given topic's partition offsets and
  ownership information for every
  consumer group. For instance --topic
  t1
WARNING: Group deletion only works for
  old ZK-based consumer groups, and
  one has to use it carefully to only
  delete groups that are not active.
{noformat}

So this JIRA should say that the RFE is to let us delete kafka-based offsets.

> Add a tool to delete consumer offsets for a given group
> ---
>
> Key: KAFKA-6314
> URL: https://issues.apache.org/jira/browse/KAFKA-6314
> Project: Kafka
>  Issue Type: New Feature
>  Components: consumer, core, tools
>Reporter: Tom Scott
>Priority: Minor
>
> Add a tool to delete consumer offsets for a given group similar to the reset 
> tool. It could look something like this:
> kafka-consumer-groups --bootstrap-server localhost:9092 --delete-offsets 
> --group somegroup
> The case for this is as follows:
> 1. Consumer group with id: group1 subscribes to topic1
> 2. The group is stopped 
> 3. The subscription changed to topic2 but the id is kept as group1
> Now the out output of kafka-consumer-groups --describe for the group will 
> show topic1 even though the group is not subscribed to that topic. This is 
> bad for monitoring as it will show lag on topic1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-6312) Add documentation about kafka-consumer-groups.sh's ability to set/change offsets

2017-12-05 Thread James Cheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Cheng updated KAFKA-6312:
---
Labels: newbie  (was: )

> Add documentation about kafka-consumer-groups.sh's ability to set/change 
> offsets
> 
>
> Key: KAFKA-6312
> URL: https://issues.apache.org/jira/browse/KAFKA-6312
> Project: Kafka
>  Issue Type: Improvement
>  Components: documentation
>Reporter: James Cheng
>  Labels: newbie
>
> KIP-122 added the ability for kafka-consumer-groups.sh to reset/change 
> consumer offsets, at a fine grained level.
> There is documentation on it in the kafka-consumer-groups.sh usage text. 
> There is no such documentation on the kafka.apache.org website. We should add 
> some documentation to the website, so that users can read about the 
> functionality without having the tools installed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-6312) Add documentation about kafka-consumer-groups.sh's ability to set/change offsets

2017-12-05 Thread James Cheng (JIRA)
James Cheng created KAFKA-6312:
--

 Summary: Add documentation about kafka-consumer-groups.sh's 
ability to set/change offsets
 Key: KAFKA-6312
 URL: https://issues.apache.org/jira/browse/KAFKA-6312
 Project: Kafka
  Issue Type: Improvement
  Components: documentation
Reporter: James Cheng


KIP-122 added the ability for kafka-consumer-groups.sh to reset/change consumer 
offsets, at a fine grained level.

There is documentation on it in the kafka-consumer-groups.sh usage text. 

There is no such documentation on the kafka.apache.org website. We should add 
some documentation to the website, so that users can read about the 
functionality without having the tools installed.





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-3806) Adjust default values of log.retention.hours and offsets.retention.minutes

2017-11-15 Thread James Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-3806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16253663#comment-16253663
 ] 

James Cheng commented on KAFKA-3806:


[~jcrowley], you should check out 
https://issues.apache.org/jira/browse/KAFKA-4682 and associated KIP 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-211%3A+Revise+Expiration+Semantics+of+Consumer+Group+Offsets

On your question about per-groupid basis, I think that when your consumer 
commit offsets, you *might* be able to specify your own expiration period. I'm 
not sure, though. 

> Adjust default values of log.retention.hours and offsets.retention.minutes
> --
>
> Key: KAFKA-3806
> URL: https://issues.apache.org/jira/browse/KAFKA-3806
> Project: Kafka
>  Issue Type: Improvement
>  Components: config
>Affects Versions: 0.9.0.1, 0.10.0.0
>Reporter: Michal Turek
>Priority: Minor
> Fix For: 1.1.0
>
>
> Combination of default values of log.retention.hours (168 hours = 7 days) and 
> offsets.retention.minutes (1440 minutes = 1 day) may be dangerous in special 
> cases. Offset retention should be always greater than log retention.
> We have observed the following scenario and issue:
> - Producing of data to a topic was disabled two days ago by producer update, 
> topic wasn't deleted.
> - Consumer consumed all data and properly committed offsets to Kafka.
> - Consumer made no more offset commits for that topic because there was no 
> more incoming data and there was nothing to confirm. (We have auto-commit 
> disabled, I'm not sure how behaves enabled auto-commit.)
> - After one day: Kafka cleared too old offsets according to 
> offsets.retention.minutes.
> - After two days: Long-term running consumer was restarted after update, it 
> didn't find any committed offsets for that topic since they were deleted by 
> offsets.retention.minutes so it started consuming from the beginning.
> - The messages were still in Kafka due to larger log.retention.hours, about 5 
> days of messages were read again.
> Known workaround to solve this issue:
> - Explicitly configure log.retention.hours and offsets.retention.minutes, 
> don't use defaults.
> Proposals:
> - Prolong default value of offsets.retention.minutes to be at least twice 
> larger than log.retention.hours.
> - Check these values during Kafka startup and log a warning if 
> offsets.retention.minutes is smaller than log.retention.hours.
> - Add a note to migration guide about differences between storing of offsets 
> in ZooKeeper and Kafka (http://kafka.apache.org/documentation.html#upgrade).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-6197) Difficult to get to the Kafka Streams javadocs

2017-11-09 Thread James Cheng (JIRA)
James Cheng created KAFKA-6197:
--

 Summary: Difficult to get to the Kafka Streams javadocs
 Key: KAFKA-6197
 URL: https://issues.apache.org/jira/browse/KAFKA-6197
 Project: Kafka
  Issue Type: Improvement
  Components: documentation
Reporter: James Cheng


In order to get to the javadocs for the Kafka producer/consumer/streams, I 
typically go to http://kafka.apache.org/documentation/ and click on either 2.1 
2.2 or 2.3 in the table of contents to go right to appropriate section.

The link for "Streams API" now goes to the (very nice) 
http://kafka.apache.org/10/documentation/streams/. That page doesn't have a 
direct link to the Javadocs anywhere. The examples and guides actually 
frequently mention "See javadocs for details" but there are no direct links to 
it.

If I instead go back to the main page and scroll directly to section 2.3, there 
is still the link to get to the javadocs. But it's harder to jump immediately 
to it. And it's a little confusing that section 2.3 in the table of contents 
does not link you to section 2.3 of the page.

It would be nice if the link to the Streams javadocs was easier to get to.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6175) AbstractIndex should cache index file to avoid unnecessary disk access during resize()

2017-11-06 Thread James Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16241010#comment-16241010
 ] 

James Cheng commented on KAFKA-6175:


[~lindong] Wow. Great work. Looking forward to this.


> AbstractIndex should cache index file to avoid unnecessary disk access during 
> resize()
> --
>
> Key: KAFKA-6175
> URL: https://issues.apache.org/jira/browse/KAFKA-6175
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Dong Lin
>Assignee: Dong Lin
> Fix For: 1.0.1
>
>
> Currently when we shutdown a broker, we will call AbstractIndex.resize() for 
> all segments on the broker, regardless of whether the log segment is active 
> or not. AbstractIndex.resize() incurs raf.setLength(), which is expensive 
> because it accesses disks. If we do a threaddump during either 
> LogManger.shutdown() or LogManager.loadLogs(), most threads are in RUNNABLE 
> state at java.io.RandomAccessFile.setLength().
> This patch intends to speed up broker startup and shutdown time by skipping 
> AbstractIndex.resize() for inactive log segments.
> Here is the time of LogManager.shutdown() in various settings. In all these 
> tests, broker has roughly 6k partitions and 19k segments.
> - If broker does not have this patch and KAFKA-6172, LogManager.shutdown() 
> takes 69 seconds
> - If broker has KAFKA-6172 but not this patch, LogManager.shutdown() takes 21 
> seconds.
> - If broker has KAFKA-6172 and this patch, LogManager.shutdown() takes 1.6 
> seconds.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6175) AbstractIndex should cache index file to avoid unnecessary disk access during resize()

2017-11-05 Thread James Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16239899#comment-16239899
 ] 

James Cheng commented on KAFKA-6175:


Do you have any estimates of how much time is saved, similar to your benchmarks 
in https://issues.apache.org/jira/browse/KAFKA-6172?

> AbstractIndex should cache index file to avoid unnecessary disk access during 
> resize()
> --
>
> Key: KAFKA-6175
> URL: https://issues.apache.org/jira/browse/KAFKA-6175
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Dong Lin
>Assignee: Dong Lin
> Fix For: 1.0.1
>
>
> Currently when we shutdown a broker, we will call AbstractIndex.resize() for 
> all segments on the broker, regardless of whether the log segment is active 
> or not. AbstractIndex.resize() incurs raf.setLength(), which is expensive 
> because it accesses disks. If we do a threaddump during either 
> LogManger.shutdown() or LogManager.loadLogs(), most threads are in RUNNABLE 
> state at java.io.RandomAccessFile.setLength().
> This patch intends to speed up broker startup and shutdown time by skipping 
> AbstractIndex.resize() for inactive log segments.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (KAFKA-6088) Kafka Consumer slows down when reading from highly compacted topics

2017-10-18 Thread James Cheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Cheng resolved KAFKA-6088.

Resolution: Won't Fix

It is fixed in kafka client 0.11.0.0, and 0.11.0.0 clients can be used against 
brokers as far back as 0.10.0.0. So if anyone is affected, they can update 
their kafka clients in order to get the fix. So, we won't issue a patch fix to 
older releases.

> Kafka Consumer slows down when reading from highly compacted topics
> ---
>
> Key: KAFKA-6088
> URL: https://issues.apache.org/jira/browse/KAFKA-6088
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 0.10.2.1
>Reporter: James Cheng
> Fix For: 0.11.0.0
>
>
> Summary of the issue
> -
> We found a performance issue with the Kafka Consumer where it gets less 
> efficient if you have frequent gaps in offsets (which happens when there is 
> lots of compaction on the topic).
> The issue is present in 0.10.2.1 and possibly prior.
> It is fixed in 0.11.0.0.
> Summary of cause
> -
> The fetcher code assumes that there will be no gaps in message offsets. If 
> there are, it does an additional round trip to the broker. For topics with 
> large gaps in offsets, it is possible that most calls to {{poll()}} will 
> generate a roundtrip to the broker.
> Background and details 
> -
> We have a topic with roughly 8 million records. The topic is log compacted. 
> It turns out that most of the initial records in the topic were never 
> overwritten, whereas in the 2nd half of the topic we had lots of overwritten 
> records. That means that for the first part of the topic, there are no gaps 
> in offsets. But in the 2nd part of the topic, there are frequent gaps in the 
> offsets (due to records being compacted away).
> We have a consumer that starts up and reads the entire topic from beginning 
> to end. We noticed that the consumer would read through the first part of the 
> topic very quickly. When it got to the part of the topic with frequent gaps 
> in offsets, consumption rate slowed down dramatically. This slowdown was 
> consistent across multiple runs.
> What is happening is this:
> 1) A call to {{poll()}} happens. The consumer goes to the broker and returns 
> 1MB of data (the default of {{max.partition.fetch.bytes}}). It then returns 
> to the caller just 500 records (the default of {{max.poll.records}}), and 
> keeps the rest of the data in memory to use in future calls to {{poll()}}. 
> 2) Before returning the 500 records, the consumer library records the *next* 
> offset it should return. It does so by taking the offset of the last record, 
> and adds 1 to it. (The offset of the 500th message from the set, plus 1). It 
> calls this the {{nextOffset}}
> 3) The application finishes processing the 500 messages, and makes another 
> call to {{poll()}} happens. During this call, the consumer library does a 
> sanity check. It checks that the first message of the set *it is about to 
> return* has an offset that matches the value of {{nextOffset}}. That is it 
> checks if the 501th record has an offset that is 1 greater than the 500th 
> record.
>   a. If it matches, then it returns an additional 500 records, and 
> increments the {{nextOffset}} to (offset of the 1000th record, plus 1)
>   b. If it doesn't match, then it throws away the remainder of the 1MB of 
> data that it stored in memory in step 1, and it goes back to the broker to 
> fetch an additional 1MB of data, starting at the offset {{nextOffset}}.
> In topics have no gaps (a non-compacted topic), then the code will always hit 
> the 3a code path.
> If the topic has gaps in offsets and the call to {{poll()}} happens to fall 
> onto a gap, then the code will hit code path 3b.
> If the gaps are frequent, then it will frequently hit code path 3b.
> The worst case scenario that can happen is if you have a large number of 
> gaps, and you run with {{max.poll.records=1}}. Every gap will result in a new 
> fetch to the broker. You may possibly end up only processing one message per 
> fetch. Or, said another way, you will end up doing a single fetch for every 
> single message in the partition.
> Repro
> -
> We created a repro. It appears that the bug is in 0.10.2.1, but was fixed in 
> 0.11. I've attached the tarball with all the code and instructions. 
> The repro is:
> 1) Create a single partition topic with log compaction turned on 
> 2) Write messages with the following keys: 1 1 2 2 3 3 4 4 5 5 ... (each 
> message key written twice in a row) 
> 3) Let compaction happen. This would mean that that offsets 0 2 4 6 8 10 ... 
> would be compacted away 
> 4) Consume from this topic with {{max.poll.records=1}}
> More concretely,
> Here is the producer code:
> {code}
> Producer 

[jira] [Commented] (KAFKA-6054) ERROR "SubscriptionInfo - unable to decode subscription data: version=2" when upgrading from 0.10.0.0 to 0.10.2.1

2017-10-11 Thread James Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16200571#comment-16200571
 ] 

James Cheng commented on KAFKA-6054:


Here is my conversation with [~mjsax] from the Confluent Slack channel:

{quote}
James Cheng [9:16 AM] 
Does this stack trace mean anything to anyone? It happened when we upgraded a 
kafka streams app from 0.10.0.0 to 0.10.2.1.
^ @mjsax, if you have any time to look. Thanks.


Matthias J Sax 
[9:20 AM] 
That makes sense. We bumped the internal version number when adding IQ feature 
-- thus, it seems you cannot mix instances for both version.


[9:21] 
Seems, we messed up the upgrade path :disappointed:


[9:21] 
If you can, you would need to stop all old instances, before starting with the 
new version.


[9:21] 
Can you also open a JIRA for this?


[9:24] 
Thus, rolling bounces to upgrade should actually work -- is this what you are 
doing?


James Cheng [9:27 AM] 
Yes, we're doing a rolling upgrade. We had (at one point, at least) both 
instances running.


[9:27] 
I imagine that if the 0.10.0.0 versions crashed, then restarted running 
0.10.2.1, then they would be fine because they are all the same version at that 
point, right?


Matthias J Sax 
[9:27 AM] 
Yes.


James Cheng [9:27 AM] 
Cool, thanks.


Matthias J Sax 
[9:28 AM] 
Anyway. Please file a JIRA -- upgrading should always work without this error.


James Cheng [9:29 AM] 
I'll file the JIRA.



Matthias J Sax 
[9:30 AM] 
Thx.
{quote}

> ERROR "SubscriptionInfo - unable to decode subscription data: version=2" when 
> upgrading from 0.10.0.0 to 0.10.2.1
> -
>
> Key: KAFKA-6054
> URL: https://issues.apache.org/jira/browse/KAFKA-6054
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.10.2.1
>Reporter: James Cheng
>
> We upgraded an app from kafka-streams 0.10.0.0 to 0.10.2.1. We did a rolling 
> upgrade of the app, so that one point, there were both 0.10.0.0-based 
> instances and 0.10.2.1-based instances running. 
> We observed the following stack trace:
> {code}
> 2017-10-11 07:02:19.964 [StreamThread-3] ERROR o.a.k.s.p.i.a.SubscriptionInfo 
> -
> unable to decode subscription data: version=2
> org.apache.kafka.streams.errors.TaskAssignmentException: unable to decode
> subscription data: version=2
> at 
> org.apache.kafka.streams.processor.internals.assignment.SubscriptionInfo.decode(SubscriptionInfo.java:113)
> at 
> org.apache.kafka.streams.processor.internals.StreamPartitionAssignor.assign(StreamPartitionAssignor.java:235)
> at 
> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.performAssignment(ConsumerCoordinator.java:260)
> at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.onJoinLeader(AbstractCoordinator.java:404)
> at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.access$900(AbstractCoordinator.java:81)
> at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator$JoinGroupResponseHandler.handle(AbstractCoordinator.java:358)
> at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator$JoinGroupResponseHandler.handle(AbstractCoordinator.java:340)
> at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:679)
> at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:658)
> at 
> org.apache.kafka.clients.consumer.internals.RequestFuture$1.onSuccess(RequestFuture.java:167)
> at 
> org.apache.kafka.clients.consumer.internals.RequestFuture.fireSuccess(RequestFuture.java:133)
> at 
> org.apache.kafka.clients.consumer.internals.RequestFuture.complete(RequestFuture.java:107)
> at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.onComplete(ConsumerNetworkClient.java:426)
> at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:278)
> at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.clientPoll(ConsumerNetworkClient.java:360)
> at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:224)
> at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:192)
> at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:163)
> at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:243)
> at 
> 

[jira] [Created] (KAFKA-6054) ERROR "SubscriptionInfo - unable to decode subscription data: version=2" when upgrading from 0.10.0.0 to 0.10.2.1

2017-10-11 Thread James Cheng (JIRA)
James Cheng created KAFKA-6054:
--

 Summary: ERROR "SubscriptionInfo - unable to decode subscription 
data: version=2" when upgrading from 0.10.0.0 to 0.10.2.1
 Key: KAFKA-6054
 URL: https://issues.apache.org/jira/browse/KAFKA-6054
 Project: Kafka
  Issue Type: Bug
  Components: streams
Affects Versions: 0.10.2.1
Reporter: James Cheng


We upgraded an app from kafka-streams 0.10.0.0 to 0.10.2.1. We did a rolling 
upgrade of the app, so that one point, there were both 0.10.0.0-based instances 
and 0.10.2.1-based instances running. 

We observed the following stack trace:

{code}
2017-10-11 07:02:19.964 [StreamThread-3] ERROR o.a.k.s.p.i.a.SubscriptionInfo -
unable to decode subscription data: version=2
org.apache.kafka.streams.errors.TaskAssignmentException: unable to decode
subscription data: version=2
at 
org.apache.kafka.streams.processor.internals.assignment.SubscriptionInfo.decode(SubscriptionInfo.java:113)
at 
org.apache.kafka.streams.processor.internals.StreamPartitionAssignor.assign(StreamPartitionAssignor.java:235)
at 
org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.performAssignment(ConsumerCoordinator.java:260)
at 
org.apache.kafka.clients.consumer.internals.AbstractCoordinator.onJoinLeader(AbstractCoordinator.java:404)
at 
org.apache.kafka.clients.consumer.internals.AbstractCoordinator.access$900(AbstractCoordinator.java:81)
at 
org.apache.kafka.clients.consumer.internals.AbstractCoordinator$JoinGroupResponseHandler.handle(AbstractCoordinator.java:358)
at 
org.apache.kafka.clients.consumer.internals.AbstractCoordinator$JoinGroupResponseHandler.handle(AbstractCoordinator.java:340)
at 
org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:679)
at 
org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:658)
at 
org.apache.kafka.clients.consumer.internals.RequestFuture$1.onSuccess(RequestFuture.java:167)
at 
org.apache.kafka.clients.consumer.internals.RequestFuture.fireSuccess(RequestFuture.java:133)
at 
org.apache.kafka.clients.consumer.internals.RequestFuture.complete(RequestFuture.java:107)
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.onComplete(ConsumerNetworkClient.java:426)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:278)
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.clientPoll(ConsumerNetworkClient.java:360)
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:224)
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:192)
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:163)
at 
org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:243)
at 
org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.ensurePartitionAssignment(ConsumerCoordinator.java:345)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:977)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:937)
at 
org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:295)
at 
org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:218)

{code}

I spoke with [~mjsax] and he said this is a known issue that happens when you 
have both 0.10.0.0 instances and 0.10.2.1 instances running at the same time, 
because the internal version number of the protocol changed when adding 
Interactive Queries. Matthias asked me to file this JIRA>



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-5951) Autogenerate Producer RecordAccumulator metrics

2017-10-04 Thread James Cheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Cheng updated KAFKA-5951:
---
Fix Version/s: (was: 1.0.0)
   1.1.0

> Autogenerate Producer RecordAccumulator metrics
> ---
>
> Key: KAFKA-5951
> URL: https://issues.apache.org/jira/browse/KAFKA-5951
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: James Cheng
>Assignee: James Cheng
> Fix For: 1.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5951) Autogenerate Producer RecordAccumulator metrics

2017-10-04 Thread James Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192224#comment-16192224
 ] 

James Cheng commented on KAFKA-5951:


I won't be able to address [~ijuma]'s PR feedback in time for 1.0.0. I will 
move the JIRA to 1.1.0.

> Autogenerate Producer RecordAccumulator metrics
> ---
>
> Key: KAFKA-5951
> URL: https://issues.apache.org/jira/browse/KAFKA-5951
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: James Cheng
>Assignee: James Cheng
> Fix For: 1.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-5952) Refactor Consumer Fetcher metrics

2017-09-21 Thread James Cheng (JIRA)
James Cheng created KAFKA-5952:
--

 Summary: Refactor Consumer Fetcher metrics
 Key: KAFKA-5952
 URL: https://issues.apache.org/jira/browse/KAFKA-5952
 Project: Kafka
  Issue Type: Sub-task
Reporter: James Cheng
Assignee: James Cheng






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5951) Autogenerate Producer RecordAccumulator metrics

2017-09-21 Thread James Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16174346#comment-16174346
 ] 

James Cheng commented on KAFKA-5951:


[~guozhang] [~ijuma], I'd like to get this into 1.0.0, if possible. I think 
it's a low risk change (it's mostly refactoring), and it will make the docs 
better. As a matter of fact, the docs currently show only 4 metrics for the 
RecordAccumulator+BufferPool, and there are in fact 7 in the code.

> Autogenerate Producer RecordAccumulator metrics
> ---
>
> Key: KAFKA-5951
> URL: https://issues.apache.org/jira/browse/KAFKA-5951
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: James Cheng
>Assignee: James Cheng
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-5951) Autogenerate Producer RecordAccumulator metrics

2017-09-21 Thread James Cheng (JIRA)
James Cheng created KAFKA-5951:
--

 Summary: Autogenerate Producer RecordAccumulator metrics
 Key: KAFKA-5951
 URL: https://issues.apache.org/jira/browse/KAFKA-5951
 Project: Kafka
  Issue Type: Sub-task
Reporter: James Cheng






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5738) Add cumulative count attribute for all Kafka rate metrics

2017-09-17 Thread James Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16169222#comment-16169222
 ] 

James Cheng commented on KAFKA-5738:


The PR for this forgot to add the new Sender metrics to getAllTemplates() 
method, which means we won't have autogenerated docs for those metrics. 
https://github.com/apache/kafka/blame/8a5e86660593eab49c64fdfb5ef090634ae5ae06/clients/src/main/java/org/apache/kafka/clients/producer/internals/SenderMetricsRegistry.java#L111

It's an easy mistake to make, since there is unfortunately no automatic way to 
keep that list in sync with the actual metrics.

Should I submit a PR to add those in? Or [~rsivaram], would you like to do it?

> Add cumulative count attribute for all Kafka rate metrics
> -
>
> Key: KAFKA-5738
> URL: https://issues.apache.org/jira/browse/KAFKA-5738
> Project: Kafka
>  Issue Type: New Feature
>  Components: metrics
>Reporter: Rajini Sivaram
>Assignee: Rajini Sivaram
> Fix For: 1.0.0
>
>
> Add cumulative count attribute to all Kafka rate metrics to make downstream 
> processing simpler, more accurate, and more flexible.
>  
> See 
> [KIP-187|https://cwiki.apache.org/confluence/display/KAFKA/KIP-187+-+Add+cumulative+count+metric+for+all+Kafka+rate+metrics]
>  for details.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


  1   2   >