[jira] [Comment Edited] (KAFKA-8604) kafka log dir was marked as offline because of deleting segments of __consumer_offsets failed
[ https://issues.apache.org/jira/browse/KAFKA-8604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884547#comment-16884547 ] songyingshuan edited comment on KAFKA-8604 at 7/14/19 5:56 AM: --- I think we have got the reason : in our kafka cluster, there is a topic with high throughput. And the consumer commit its offset everytime it is going to poll() more messages. This consumer group commit about 7.5 million records in about 5 minutes, which cause the __consuemr_group-38 partition rolls a new segment file about every 20 seconds. Whenever there is a new segment created, the log cleaner thread tried to clean this partition (we config 'log.cleaner.threads=11', and there is only a few topics are 'compact' type). At the end of cleaning process, a asyncDeleteSegment task will be sheduled (default : 60s later), if the next two tasks have the same file to delete, the latter will fail。 Based on the analysis result,we first modified the consumer's code,auto commit was used ant the interval was fix to 3 seconds. And the LogDirFailuer have not appeared yet. So, we think the asyncDeleteSegment should be syncDeleteSegment or decrease the default interval of asyn delete operation. was (Author: ymxz): I think we have got the reason : in our kafka cluster, there is a topic with high throughput. And the consumer commit its offset everytime it is going to poll() more messages. This consumer group commit about 7.5 million records in about 5 minutes, which cause the __consuemr_group-38 partition rolls a new segment file about every 20 seconds. Whenever there is a new segment created, the log cleaner thread tried to clean this partition (we config 'log.cleaner.threads =11', and there is only a few topics are 'compact' type). At the end of cleaning process, a asyncDeleteSegment task will be sheduled (default : 60s later), if the next two tasks have the same file to delete, the latter will fail。 Based on the analysis result,we first modified the consumer's code,auto commit was used ant the interval was fix to 3 seconds. And the LogDirFailuer have not appeared yet. So, we think the asyncDeleteSegment should be syncDeleteSegment or decrease the default interval of asyn delete operation. > kafka log dir was marked as offline because of deleting segments of > __consumer_offsets failed > - > > Key: KAFKA-8604 > URL: https://issues.apache.org/jira/browse/KAFKA-8604 > Project: Kafka > Issue Type: Bug > Components: log cleaner >Affects Versions: 1.0.1 >Reporter: songyingshuan >Priority: Major > Attachments: error-logs.log > > > We encountered a problem in our product env without any foresight. When kafka > broker trying to clean __consumer_offsets-38 (and only happents to this > partition), the log shows > it failed, and marking the whole disk/log dir offline, and this leads to a > negative impact on some normal partitions (because of the ISR list of those > partitions decrease). > we had to restart the broker server to reuse the disk/dir which was marked as > offline. BUT!! this problem occurs periodically with the same reason so we > have to restart broker periodically. > we read some source code of kafka-1.0.1, but cannot make sure why this > happends. And The cluster status had been good until this problem suddenly > attacked us. > the error log is something like this : > > {code:java} > 2019-06-25 00:11:26,241 INFO kafka.log.TimeIndex: Deleting index > /data6/kafka/data/__consumer_offsets-38/012855596978.timeindex.deleted > 2019-06-25 00:11:26,258 ERROR kafka.server.LogDirFailureChannel: Error while > deleting segments for __consumer_offsets-38 in dir /data6/kafka/data > java.io.IOException: Delete of log .log.deleted failed. > at kafka.log.LogSegment.delete(LogSegment.scala:496) > at > kafka.log.Log$$anonfun$kafka$log$Log$$deleteSeg$1$1.apply$mcV$sp(Log.scala:1596) > at kafka.log.Log$$anonfun$kafka$log$Log$$deleteSeg$1$1.apply(Log.scala:1596) > at kafka.log.Log$$anonfun$kafka$log$Log$$deleteSeg$1$1.apply(Log.scala:1596) > at kafka.log.Log.maybeHandleIOException(Log.scala:1669) > at kafka.log.Log.kafka$log$Log$$deleteSeg$1(Log.scala:1595) > at > kafka.log.Log$$anonfun$kafka$log$Log$$asyncDeleteSegment$1.apply$mcV$sp(Log.scala:1599) > at > kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(KafkaScheduler.scala:110) > at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:61) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.
[jira] [Commented] (KAFKA-8604) kafka log dir was marked as offline because of deleting segments of __consumer_offsets failed
[ https://issues.apache.org/jira/browse/KAFKA-8604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884548#comment-16884548 ] songyingshuan commented on KAFKA-8604: -- It is worth mentioning that we have another kafka cluster specifically used to perform kafka-streaming/sql tasks. the same problem appeared in that cluster. And looking back now, it is most likely because the streaming/ksql tasks consume a topic with high throughput and update with high frequency. How do u think? [~junrao] [~huxi_2b] > kafka log dir was marked as offline because of deleting segments of > __consumer_offsets failed > - > > Key: KAFKA-8604 > URL: https://issues.apache.org/jira/browse/KAFKA-8604 > Project: Kafka > Issue Type: Bug > Components: log cleaner >Affects Versions: 1.0.1 >Reporter: songyingshuan >Priority: Major > Attachments: error-logs.log > > > We encountered a problem in our product env without any foresight. When kafka > broker trying to clean __consumer_offsets-38 (and only happents to this > partition), the log shows > it failed, and marking the whole disk/log dir offline, and this leads to a > negative impact on some normal partitions (because of the ISR list of those > partitions decrease). > we had to restart the broker server to reuse the disk/dir which was marked as > offline. BUT!! this problem occurs periodically with the same reason so we > have to restart broker periodically. > we read some source code of kafka-1.0.1, but cannot make sure why this > happends. And The cluster status had been good until this problem suddenly > attacked us. > the error log is something like this : > > {code:java} > 2019-06-25 00:11:26,241 INFO kafka.log.TimeIndex: Deleting index > /data6/kafka/data/__consumer_offsets-38/012855596978.timeindex.deleted > 2019-06-25 00:11:26,258 ERROR kafka.server.LogDirFailureChannel: Error while > deleting segments for __consumer_offsets-38 in dir /data6/kafka/data > java.io.IOException: Delete of log .log.deleted failed. > at kafka.log.LogSegment.delete(LogSegment.scala:496) > at > kafka.log.Log$$anonfun$kafka$log$Log$$deleteSeg$1$1.apply$mcV$sp(Log.scala:1596) > at kafka.log.Log$$anonfun$kafka$log$Log$$deleteSeg$1$1.apply(Log.scala:1596) > at kafka.log.Log$$anonfun$kafka$log$Log$$deleteSeg$1$1.apply(Log.scala:1596) > at kafka.log.Log.maybeHandleIOException(Log.scala:1669) > at kafka.log.Log.kafka$log$Log$$deleteSeg$1(Log.scala:1595) > at > kafka.log.Log$$anonfun$kafka$log$Log$$asyncDeleteSegment$1.apply$mcV$sp(Log.scala:1599) > at > kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(KafkaScheduler.scala:110) > at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:61) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2019-06-25 00:11:26,265 ERROR kafka.utils.KafkaScheduler: Uncaught exception > in scheduled task 'delete-file' > org.apache.kafka.common.errors.KafkaStorageException: Error while deleting > segments for __consumer_offsets-38 in dir /data6/kafka/data > Caused by: java.io.IOException: Delete of log > .log.deleted failed. > at kafka.log.LogSegment.delete(LogSegment.scala:496) > at > kafka.log.Log$$anonfun$kafka$log$Log$$deleteSeg$1$1.apply$mcV$sp(Log.scala:1596) > at kafka.log.Log$$anonfun$kafka$log$Log$$deleteSeg$1$1.apply(Log.scala:1596) > at kafka.log.Log$$anonfun$kafka$log$Log$$deleteSeg$1$1.apply(Log.scala:1596) > at kafka.log.Log.maybeHandleIOException(Log.scala:1669) > at kafka.log.Log.kafka$log$Log$$deleteSeg$1(Log.scala:1595) > at > kafka.log.Log$$anonfun$kafka$log$Log$$asyncDeleteSegment$1.apply$mcV$sp(Log.scala:1599) > at > kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(KafkaScheduler.scala:110) > at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:61) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.
[jira] [Comment Edited] (KAFKA-8604) kafka log dir was marked as offline because of deleting segments of __consumer_offsets failed
[ https://issues.apache.org/jira/browse/KAFKA-8604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884547#comment-16884547 ] songyingshuan edited comment on KAFKA-8604 at 7/14/19 3:50 AM: --- I think we have got the reason : in our kafka cluster, there is a topic with high throughput. And the consumer commit its offset everytime it is going to poll() more messages. This consumer group commit about 7.5 million records in about 5 minutes, which cause the __consuemr_group-38 partition rolls a new segment file about every 20 seconds. Whenever there is a new segment created, the log cleaner thread tried to clean this partition (we config 'log.cleaner.threads =11', and there is only a few topics are 'compact' type). At the end of cleaning process, a asyncDeleteSegment task will be sheduled (default : 60s later), if the next two tasks have the same file to delete, the latter will fail。 Based on the analysis result,we first modified the consumer's code,auto commit was used ant the interval was fix to 3 seconds. And the LogDirFailuer have not appeared yet. So, we think the asyncDeleteSegment should be syncDeleteSegment or decrease the default interval of asyn delete operation. was (Author: ymxz): I think we have got the reason : in our kafka cluster, there is a topic with high throughput. And the consumer commit its offset everytime it is going to poll() more messages. This consumer group commit about 7.5 million records in about 5 minutes, which cause the __consuemr_group-38 partition rolls a new segment file about every 20 seconds. Whenever there is a new segment created, the log cleaner thread tried to clean this partition (we config 'log.cleaner.threads =11', and there is only a few topics are 'compact' type). At the end of cleaning process, a asyncDeleteSegment task will be sheduled (default : 60s later), if the next two tasks have the same file to delete, the latter will fail。 Based on the analysis result,we first modified the consumer's code,auto commit was used ant the interval was fix to 3 seconds. And the LogDirFailuer have not appeared yet. So, we think the asyncDeleteSegment should be asyncDeleteSegment or decrease the default interval of asyn delete operation. > kafka log dir was marked as offline because of deleting segments of > __consumer_offsets failed > - > > Key: KAFKA-8604 > URL: https://issues.apache.org/jira/browse/KAFKA-8604 > Project: Kafka > Issue Type: Bug > Components: log cleaner >Affects Versions: 1.0.1 >Reporter: songyingshuan >Priority: Major > Attachments: error-logs.log > > > We encountered a problem in our product env without any foresight. When kafka > broker trying to clean __consumer_offsets-38 (and only happents to this > partition), the log shows > it failed, and marking the whole disk/log dir offline, and this leads to a > negative impact on some normal partitions (because of the ISR list of those > partitions decrease). > we had to restart the broker server to reuse the disk/dir which was marked as > offline. BUT!! this problem occurs periodically with the same reason so we > have to restart broker periodically. > we read some source code of kafka-1.0.1, but cannot make sure why this > happends. And The cluster status had been good until this problem suddenly > attacked us. > the error log is something like this : > > {code:java} > 2019-06-25 00:11:26,241 INFO kafka.log.TimeIndex: Deleting index > /data6/kafka/data/__consumer_offsets-38/012855596978.timeindex.deleted > 2019-06-25 00:11:26,258 ERROR kafka.server.LogDirFailureChannel: Error while > deleting segments for __consumer_offsets-38 in dir /data6/kafka/data > java.io.IOException: Delete of log .log.deleted failed. > at kafka.log.LogSegment.delete(LogSegment.scala:496) > at > kafka.log.Log$$anonfun$kafka$log$Log$$deleteSeg$1$1.apply$mcV$sp(Log.scala:1596) > at kafka.log.Log$$anonfun$kafka$log$Log$$deleteSeg$1$1.apply(Log.scala:1596) > at kafka.log.Log$$anonfun$kafka$log$Log$$deleteSeg$1$1.apply(Log.scala:1596) > at kafka.log.Log.maybeHandleIOException(Log.scala:1669) > at kafka.log.Log.kafka$log$Log$$deleteSeg$1(Log.scala:1595) > at > kafka.log.Log$$anonfun$kafka$log$Log$$asyncDeleteSegment$1.apply$mcV$sp(Log.scala:1599) > at > kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(KafkaScheduler.scala:110) > at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:61) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent
[jira] [Commented] (KAFKA-8604) kafka log dir was marked as offline because of deleting segments of __consumer_offsets failed
[ https://issues.apache.org/jira/browse/KAFKA-8604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884547#comment-16884547 ] songyingshuan commented on KAFKA-8604: -- I think we have got the reason : in our kafka cluster, there is a topic with high throughout. And the consumer commit its offset everytime it is going to poll() more messages. This consumer group commit about 7.5 million records in about 5 minutes, which cause the __consuemr_group-38 partition rolls a new segment file about every 20 seconds. Whenever there is a new segment created, the log cleaner thread tried to clean this partition (we config 'log.cleaner.threads =11', and there is only a few topics are 'compact' type). At the end of cleaning process, a asyncDeleteSegment task will be sheduled (default : 60s later), if the next two tasks have the same file to delete, the latter will fail。 Based on the analysis result,we first modified the consumer's code,auto commit was used ant the interval was fix to 3 seconds. And the LogDirFailuer have not appeared yet. So, we think the asyncDeleteSegment should be asyncDeleteSegment or decrease the default interval of asyn delete operation. > kafka log dir was marked as offline because of deleting segments of > __consumer_offsets failed > - > > Key: KAFKA-8604 > URL: https://issues.apache.org/jira/browse/KAFKA-8604 > Project: Kafka > Issue Type: Bug > Components: log cleaner >Affects Versions: 1.0.1 >Reporter: songyingshuan >Priority: Major > Attachments: error-logs.log > > > We encountered a problem in our product env without any foresight. When kafka > broker trying to clean __consumer_offsets-38 (and only happents to this > partition), the log shows > it failed, and marking the whole disk/log dir offline, and this leads to a > negative impact on some normal partitions (because of the ISR list of those > partitions decrease). > we had to restart the broker server to reuse the disk/dir which was marked as > offline. BUT!! this problem occurs periodically with the same reason so we > have to restart broker periodically. > we read some source code of kafka-1.0.1, but cannot make sure why this > happends. And The cluster status had been good until this problem suddenly > attacked us. > the error log is something like this : > > {code:java} > 2019-06-25 00:11:26,241 INFO kafka.log.TimeIndex: Deleting index > /data6/kafka/data/__consumer_offsets-38/012855596978.timeindex.deleted > 2019-06-25 00:11:26,258 ERROR kafka.server.LogDirFailureChannel: Error while > deleting segments for __consumer_offsets-38 in dir /data6/kafka/data > java.io.IOException: Delete of log .log.deleted failed. > at kafka.log.LogSegment.delete(LogSegment.scala:496) > at > kafka.log.Log$$anonfun$kafka$log$Log$$deleteSeg$1$1.apply$mcV$sp(Log.scala:1596) > at kafka.log.Log$$anonfun$kafka$log$Log$$deleteSeg$1$1.apply(Log.scala:1596) > at kafka.log.Log$$anonfun$kafka$log$Log$$deleteSeg$1$1.apply(Log.scala:1596) > at kafka.log.Log.maybeHandleIOException(Log.scala:1669) > at kafka.log.Log.kafka$log$Log$$deleteSeg$1(Log.scala:1595) > at > kafka.log.Log$$anonfun$kafka$log$Log$$asyncDeleteSegment$1.apply$mcV$sp(Log.scala:1599) > at > kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(KafkaScheduler.scala:110) > at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:61) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2019-06-25 00:11:26,265 ERROR kafka.utils.KafkaScheduler: Uncaught exception > in scheduled task 'delete-file' > org.apache.kafka.common.errors.KafkaStorageException: Error while deleting > segments for __consumer_offsets-38 in dir /data6/kafka/data > Caused by: java.io.IOException: Delete of log > .log.deleted failed. > at kafka.log.LogSegment.delete(LogSegment.scala:496) > at > kafka.log.Log$$anonfun$kafka$log$Log$$deleteSeg$1$1.apply$mcV$sp(Log.scala:1596) > at kafka.log.Log$$anonfun$kafka$log$Log$$deleteSeg$1$1.apply(Log.scala:1596) > at kafka.log.Log$$anonfun$kafka$log$Log$$deleteSeg$1$1.apply(Log.scala:1596) > at kafka.log.Log.maybeHandleIOException(Log.scala:1669) > at kafka.log.Log.kafka$log$Log$$deleteSeg$1(Log.scala:1595) >
[jira] [Comment Edited] (KAFKA-8604) kafka log dir was marked as offline because of deleting segments of __consumer_offsets failed
[ https://issues.apache.org/jira/browse/KAFKA-8604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884547#comment-16884547 ] songyingshuan edited comment on KAFKA-8604 at 7/14/19 3:49 AM: --- I think we have got the reason : in our kafka cluster, there is a topic with high throughput. And the consumer commit its offset everytime it is going to poll() more messages. This consumer group commit about 7.5 million records in about 5 minutes, which cause the __consuemr_group-38 partition rolls a new segment file about every 20 seconds. Whenever there is a new segment created, the log cleaner thread tried to clean this partition (we config 'log.cleaner.threads =11', and there is only a few topics are 'compact' type). At the end of cleaning process, a asyncDeleteSegment task will be sheduled (default : 60s later), if the next two tasks have the same file to delete, the latter will fail。 Based on the analysis result,we first modified the consumer's code,auto commit was used ant the interval was fix to 3 seconds. And the LogDirFailuer have not appeared yet. So, we think the asyncDeleteSegment should be asyncDeleteSegment or decrease the default interval of asyn delete operation. was (Author: ymxz): I think we have got the reason : in our kafka cluster, there is a topic with high throughout. And the consumer commit its offset everytime it is going to poll() more messages. This consumer group commit about 7.5 million records in about 5 minutes, which cause the __consuemr_group-38 partition rolls a new segment file about every 20 seconds. Whenever there is a new segment created, the log cleaner thread tried to clean this partition (we config 'log.cleaner.threads =11', and there is only a few topics are 'compact' type). At the end of cleaning process, a asyncDeleteSegment task will be sheduled (default : 60s later), if the next two tasks have the same file to delete, the latter will fail。 Based on the analysis result,we first modified the consumer's code,auto commit was used ant the interval was fix to 3 seconds. And the LogDirFailuer have not appeared yet. So, we think the asyncDeleteSegment should be asyncDeleteSegment or decrease the default interval of asyn delete operation. > kafka log dir was marked as offline because of deleting segments of > __consumer_offsets failed > - > > Key: KAFKA-8604 > URL: https://issues.apache.org/jira/browse/KAFKA-8604 > Project: Kafka > Issue Type: Bug > Components: log cleaner >Affects Versions: 1.0.1 >Reporter: songyingshuan >Priority: Major > Attachments: error-logs.log > > > We encountered a problem in our product env without any foresight. When kafka > broker trying to clean __consumer_offsets-38 (and only happents to this > partition), the log shows > it failed, and marking the whole disk/log dir offline, and this leads to a > negative impact on some normal partitions (because of the ISR list of those > partitions decrease). > we had to restart the broker server to reuse the disk/dir which was marked as > offline. BUT!! this problem occurs periodically with the same reason so we > have to restart broker periodically. > we read some source code of kafka-1.0.1, but cannot make sure why this > happends. And The cluster status had been good until this problem suddenly > attacked us. > the error log is something like this : > > {code:java} > 2019-06-25 00:11:26,241 INFO kafka.log.TimeIndex: Deleting index > /data6/kafka/data/__consumer_offsets-38/012855596978.timeindex.deleted > 2019-06-25 00:11:26,258 ERROR kafka.server.LogDirFailureChannel: Error while > deleting segments for __consumer_offsets-38 in dir /data6/kafka/data > java.io.IOException: Delete of log .log.deleted failed. > at kafka.log.LogSegment.delete(LogSegment.scala:496) > at > kafka.log.Log$$anonfun$kafka$log$Log$$deleteSeg$1$1.apply$mcV$sp(Log.scala:1596) > at kafka.log.Log$$anonfun$kafka$log$Log$$deleteSeg$1$1.apply(Log.scala:1596) > at kafka.log.Log$$anonfun$kafka$log$Log$$deleteSeg$1$1.apply(Log.scala:1596) > at kafka.log.Log.maybeHandleIOException(Log.scala:1669) > at kafka.log.Log.kafka$log$Log$$deleteSeg$1(Log.scala:1595) > at > kafka.log.Log$$anonfun$kafka$log$Log$$asyncDeleteSegment$1.apply$mcV$sp(Log.scala:1599) > at > kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(KafkaScheduler.scala:110) > at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:61) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurren
[jira] [Commented] (KAFKA-8663) partition assignment would be better original_assignment + new_reassignment during reassignments
[ https://issues.apache.org/jira/browse/KAFKA-8663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884526#comment-16884526 ] GEORGE LI commented on KAFKA-8663: -- As we can see from the original comments of the code: {code} //1. Update AR in ZK with OAR + RAR. {code} But in the actual implementation, it's doing: RAR + OAR instead (different ordering). > partition assignment would be better original_assignment + new_reassignment > during reassignments > > > Key: KAFKA-8663 > URL: https://issues.apache.org/jira/browse/KAFKA-8663 > Project: Kafka > Issue Type: Improvement > Components: controller, core >Affects Versions: 1.1.1, 2.3.0 >Reporter: GEORGE LI >Priority: Minor > > From my observation/experience during reassignment, the partition assignment > replica ordering gets changed. because it's OAR + RAR (original replicas > + reassignment replicas) set union. > However, it seems like the preferred leaders changed during the > reassignments. Normally if there is no cluster preferred leader election, > the leader is still the old leader. But if during the reassignments, there > is a leader election, the leadership changes. This caused some side > effects. Let's look at this example. > {code} > Topic:georgeli_test PartitionCount:8ReplicationFactor:3 Configs: > Topic: georgeli_testPartition: 0Leader: 1026Replicas: > 1026,1028,1025Isr: 1026,1028,1025 > {code} > reassignment (1026,1028,1025) => (1027,1025,1028) > {code} > Topic:georgeli_test PartitionCount:8ReplicationFactor:4 > Configs:leader.replication.throttled.replicas=0:1026,0:1028,0:1025,follower.replication.throttled.replicas=0:1027 > Topic: georgeli_testPartition: 0Leader: 1026Replicas: > 1027,1025,1028,1026 Isr: 1026,1028,1025 > {code} > Notice the above: Leader remains 1026. but Replicas: 1027,1025,1028,1026. > If we run preferred leader election, it will try 1027 first, then 1025. > After 1027 is in ISR, then the final assignment will be (1027,1025,1028). > > My proposal for a minor improvement is to keep the original ordering replicas > during the reassignment (could be long for big topic/partitions). and after > all replicas in ISR, then finally set the partition assignment to New > reassignment. > {code} > val newAndOldReplicas = (reassignedPartitionContext.newReplicas ++ > controllerContext.partitionReplicaAssignment(topicPartition)).toSet > //1. Update AR in ZK with OAR + RAR. > updateAssignedReplicasForPartition(topicPartition, > newAndOldReplicas.toSeq) > {code} > above code changed to below to keep the original ordering first during > reassignment: > {code} > val newAndOldReplicas = > (controllerContext.partitionReplicaAssignment(topicPartition) ++ > reassignedPartitionContext.newReplicas).toSet > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (KAFKA-8663) partition assignment would be better original_assignment + new_reassignment during reassignments
[ https://issues.apache.org/jira/browse/KAFKA-8663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] GEORGE LI updated KAFKA-8663: - Description: >From my observation/experience during reassignment, the partition assignment >replica ordering gets changed. because it's OAR + RAR (original replicas + >reassignment replicas) set union. However, it seems like the preferred leaders changed during the reassignments. Normally if there is no cluster preferred leader election, the leader is still the old leader. But if during the reassignments, there is a leader election, the leadership changes. This caused some side effects. Let's look at this example. {code} Topic:georgeli_test PartitionCount:8ReplicationFactor:3 Configs: Topic: georgeli_testPartition: 0Leader: 1026Replicas: 1026,1028,1025Isr: 1026,1028,1025 {code} reassignment (1026,1028,1025) => (1027,1025,1028) {code} Topic:georgeli_test PartitionCount:8ReplicationFactor:4 Configs:leader.replication.throttled.replicas=0:1026,0:1028,0:1025,follower.replication.throttled.replicas=0:1027 Topic: georgeli_testPartition: 0Leader: 1026Replicas: 1027,1025,1028,1026 Isr: 1026,1028,1025 {code} Notice the above: Leader remains 1026. but Replicas: 1027,1025,1028,1026. If we run preferred leader election, it will try 1027 first, then 1025. After 1027 is in ISR, then the final assignment will be (1027,1025,1028). My proposal for a minor improvement is to keep the original ordering replicas during the reassignment (could be long for big topic/partitions). and after all replicas in ISR, then finally set the partition assignment to New reassignment. {code} val newAndOldReplicas = (reassignedPartitionContext.newReplicas ++ controllerContext.partitionReplicaAssignment(topicPartition)).toSet //1. Update AR in ZK with OAR + RAR. updateAssignedReplicasForPartition(topicPartition, newAndOldReplicas.toSeq) {code} above code changed to below to keep the original ordering first during reassignment: {code} val newAndOldReplicas = (controllerContext.partitionReplicaAssignment(topicPartition) ++ reassignedPartitionContext.newReplicas).toSet {code} was: >From my observation/experience during reassignment, the partition assignment >replica ordering gets changed. because it's OAR + RAR (original replicas + >reassignment replicas) set union. However, it seems like the preferred leaders changed during the reassignments. Normally if there is no cluster preferred leader election, the leader is still the old leader. But if during the reassignments, there is a leader election, the leadership changes. This caused some side effects. Let's look at this example. {code} Topic:georgeli_test PartitionCount:8ReplicationFactor:3 Configs: Topic: georgeli_testPartition: 0Leader: 1026Replicas: 1026,1028,1025Isr: 1026,1028,1025 {code} reassignment (1026,1028,1025) => (1027,1025,1028) {code} Topic:georgeli_test PartitionCount:8ReplicationFactor:4 Configs:leader.replication.throttled.replicas=0:1026,0:1028,0:1025,follower.replication.throttled.replicas=0:1027 Topic: georgeli_testPartition: 0Leader: 1026Replicas: 1027,1025,1028,1026 Isr: 1026,1028,1025 {code} Notice the above: Leader remains 1026. but Replicas: 1027,1025,1028,1026. If we run preferred leader election, it will try 1027 first, then 1025. After 1027 is in ISR, then the final assignment will be (1027,1025,1028). My proposal for a minor improvement is to keep the original ordering replicas during the reassignment (could be long for big topic/partitions). and after all replicas in ISR, then finally set the partition assignment to New reassignment. {code} val newAndOldReplicas = (reassignedPartitionContext.newReplicas ++ controllerContext.partitionReplicaAssignment(topicPartition)).toSet //1. Update AR in ZK with OAR + RAR. updateAssignedReplicasForPartition(topicPartition, newAndOldReplicas.toSeq) {code} above code changed to below to keep the original ordering during reassignment: {code} val newAndOldReplicas = (controllerContext.partitionReplicaAssignment(topicPartition) ++ reassignedPartitionContext.newReplicas).toSet { code} > partition assignment would be better original_assignment + new_reassignment > during reassignments > > > Key: KAFKA-8663 > URL: https://issues.apache.org/jira/browse/KAFKA-8663 > Project: Kafka > Issue Type: Improvement > Components: controller, core >Affects Versions: 1.1.1, 2.3.0 >Reporter: GEORGE LI >Priority: Minor > > From my observation/e
[jira] [Created] (KAFKA-8663) partition assignment would be better original_assignment + new_reassignment during reassignments
GEORGE LI created KAFKA-8663: Summary: partition assignment would be better original_assignment + new_reassignment during reassignments Key: KAFKA-8663 URL: https://issues.apache.org/jira/browse/KAFKA-8663 Project: Kafka Issue Type: Improvement Components: controller, core Affects Versions: 2.3.0, 1.1.1 Reporter: GEORGE LI >From my observation/experience during reassignment, the partition assignment >replica ordering gets changed. because it's OAR + RAR (original replicas + >reassignment replicas) set union. However, it seems like the preferred leaders changed during the reassignments. Normally if there is no cluster preferred leader election, the leader is still the old leader. But if during the reassignments, there is a leader election, the leadership changes. This caused some side effects. Let's look at this example. {code} Topic:georgeli_test PartitionCount:8ReplicationFactor:3 Configs: Topic: georgeli_testPartition: 0Leader: 1026Replicas: 1026,1028,1025Isr: 1026,1028,1025 {code} reassignment (1026,1028,1025) => (1027,1025,1028) {code} Topic:georgeli_test PartitionCount:8ReplicationFactor:4 Configs:leader.replication.throttled.replicas=0:1026,0:1028,0:1025,follower.replication.throttled.replicas=0:1027 Topic: georgeli_testPartition: 0Leader: 1026Replicas: 1027,1025,1028,1026 Isr: 1026,1028,1025 {code} Notice the above: Leader remains 1026. but Replicas: 1027,1025,1028,1026. If we run preferred leader election, it will try 1027 first, then 1025. After 1027 is in ISR, then the final assignment will be (1027,1025,1028). My proposal for a minor improvement is to keep the original ordering replicas during the reassignment (could be long for big topic/partitions). and after all replicas in ISR, then finally set the partition assignment to New reassignment. {code} val newAndOldReplicas = (reassignedPartitionContext.newReplicas ++ controllerContext.partitionReplicaAssignment(topicPartition)).toSet //1. Update AR in ZK with OAR + RAR. updateAssignedReplicasForPartition(topicPartition, newAndOldReplicas.toSeq) {code} above code changed to below to keep the original ordering during reassignment: {code} val newAndOldReplicas = (controllerContext.partitionReplicaAssignment(topicPartition) ++ reassignedPartitionContext.newReplicas).toSet { code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (KAFKA-8183) Trogdor - ProduceBench should retry on UnknownTopicOrPartitionException during topic creation
[ https://issues.apache.org/jira/browse/KAFKA-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislav Kozlovski resolved KAFKA-8183. Resolution: Fixed > Trogdor - ProduceBench should retry on UnknownTopicOrPartitionException > during topic creation > - > > Key: KAFKA-8183 > URL: https://issues.apache.org/jira/browse/KAFKA-8183 > Project: Kafka > Issue Type: Improvement >Reporter: Stanislav Kozlovski >Assignee: Stanislav Kozlovski >Priority: Minor > > There exists a race condition in the Trogdor produce bench worker code where > `WorkerUtils#createTopics()` [notices the topic > exists|https://github.com/apache/kafka/blob/4824dc994d7fc56b7540b643a78aadb4bdd0f14d/tools/src/main/java/org/apache/kafka/trogdor/common/WorkerUtils.java#L159] > yet when it goes on to verify the topics, the DescribeTopics call throws an > `UnknownTopicOrPartitionException`. > We should add sufficient retries such that this does not fail the task. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (KAFKA-8183) Trogdor - ProduceBench should retry on UnknownTopicOrPartitionException during topic creation
[ https://issues.apache.org/jira/browse/KAFKA-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884491#comment-16884491 ] Stanislav Kozlovski commented on KAFKA-8183: [https://github.com/apache/kafka/pull/6937] fixes this issue. Thanks [~cmccabe]! > Trogdor - ProduceBench should retry on UnknownTopicOrPartitionException > during topic creation > - > > Key: KAFKA-8183 > URL: https://issues.apache.org/jira/browse/KAFKA-8183 > Project: Kafka > Issue Type: Improvement >Reporter: Stanislav Kozlovski >Assignee: Stanislav Kozlovski >Priority: Minor > > There exists a race condition in the Trogdor produce bench worker code where > `WorkerUtils#createTopics()` [notices the topic > exists|https://github.com/apache/kafka/blob/4824dc994d7fc56b7540b643a78aadb4bdd0f14d/tools/src/main/java/org/apache/kafka/trogdor/common/WorkerUtils.java#L159] > yet when it goes on to verify the topics, the DescribeTopics call throws an > `UnknownTopicOrPartitionException`. > We should add sufficient retries such that this does not fail the task. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (KAFKA-8662) Produce fails if a previous produce was to an unauthorized topic
[ https://issues.apache.org/jira/browse/KAFKA-8662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884485#comment-16884485 ] ASF GitHub Bot commented on KAFKA-8662: --- rajinisivaram commented on pull request #7086: KAFKA-8662; Fix producer metadata error handling and consumer manual assignment URL: https://github.com/apache/kafka/pull/7086 Producer adds a topic to its Metadata instance when send is requested. If metadata request for the topic fails (e.g. due to authorization failure), we retain the topic in Metadata and continue to attempt refresh until a hard-coded expiry time of 5 minutes. Due to changes introduced in https://github.com/apache/kafka/commit/460e46c3bb76a361d0706b263c03696005e12566, subsequent sends to any topic including valid authorized topics report authorization failures in any topic in the metadata, rather than just the topic to which send is requested. As a result, the producer remains unusable for 5 minutes if a send is requested on an unauthorized topic. This PR fails send only if metadata for the topic being sent to has an error (or there is a fatal exception like authentication failure). Consumer adds a topic to its Metadata instance on `subscribe()` or `assign()`. Even though `assign()` is not incremental and replaces existing assignment, new assignments were being added to existing topics in SubscriptionState#groupSubscriptions, which is used for fetching topic metadata. This PR does a replace for manual assignment alone. ### Committer Checklist (excluded from commit message) - [ ] Verify design and implementation - [ ] Verify test coverage and CI build status - [ ] Verify documentation (including upgrade notes) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Produce fails if a previous produce was to an unauthorized topic > > > Key: KAFKA-8662 > URL: https://issues.apache.org/jira/browse/KAFKA-8662 > Project: Kafka > Issue Type: Bug > Components: producer >Affects Versions: 2.3.0 >Reporter: Rajini Sivaram >Assignee: Rajini Sivaram >Priority: Blocker > Fix For: 2.4.0, 2.3.1 > > > This is a regression introduced by the commit > [https://github.com/apache/kafka/commit/460e46c3bb76a361d0706b263c03696005e12566|https://github.com/apache/kafka/commit/460e46c3bb76a361d0706b263c03696005e12566.]. > When we produce to a topic, was add the topic to the producer's Metadata > instance. If metadata authorization fails for the topic, we fail the send and > propagate the authorization exception to the caller. The topic remains in the > Metadata instance. We expire the topic and remove from Metadata after a fixed > interval of 5 minutes. This has been the case for a while. > > If a subsequent send is to a different authorized topic, we may still get > metadata authorization failures for the previous unauthorized topic that is > still in Metadata. Prior to that commit in 2.3.0, send to authorized topics > completed successfully even if there were other unauthorized or invalid > topics in the Metadata. Now, we propagate the exceptions without checking > topic. This is a regression and not the expected behaviour since producer > becomes unusable for 5 minutes unless authorization is granted to the first > topic. > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (KAFKA-8613) Set default grace period to 0
[ https://issues.apache.org/jira/browse/KAFKA-8613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884478#comment-16884478 ] Lillian commented on KAFKA-8613: [~cadonna], Do you have time to write up a KIP for this issue? If not, I can try. > Set default grace period to 0 > - > > Key: KAFKA-8613 > URL: https://issues.apache.org/jira/browse/KAFKA-8613 > Project: Kafka > Issue Type: Improvement > Components: streams >Affects Versions: 3.0.0 >Reporter: Bruno Cadonna >Priority: Blocker > > Currently, the grace period is set to retention time if the grace period is > not specified explicitly. The reason for setting the default grace period to > retention time was backward compatibility. Topologies that were implemented > before the introduction of the grace period, added late arriving records to a > window as long as the window existed, i.e., as long as its retention time was > not elapsed. > This unintuitive default grace period has already caused confusion among > users. > For the next major release, we should set the default grace period to > {{Duration.ZERO}}. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (KAFKA-8608) Broker shows WARN on reassignment partitions on new brokers: Replica LEO, follower position & Cache truncation
[ https://issues.apache.org/jira/browse/KAFKA-8608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884471#comment-16884471 ] Lillian commented on KAFKA-8608: It appears that there is not enough context in order to deduce what the warning was triggered by. [~xmar], Is it possible for you to provide the log (with redactions)? > Broker shows WARN on reassignment partitions on new brokers: Replica LEO, > follower position & Cache truncation > -- > > Key: KAFKA-8608 > URL: https://issues.apache.org/jira/browse/KAFKA-8608 > Project: Kafka > Issue Type: Improvement > Components: core >Affects Versions: 2.1.1 > Environment: Kafka 2.1.1 >Reporter: Di Campo >Priority: Minor > Labels: broker, reassign, repartition > > I added two brokers (brokerId 4,5) to a 3-node (brokerId 1,2,3) cluster where > there were 32 topics and 64 partitions on each, replication 3. > Running reassigning partitions. > On each run, I can see the following WARN messages, but when the reassignment > partition process finishes, it all seems OK. ISR is OK (count is 3 in every > partition). > But I get the following messages types, one per partition: > > {code:java} > [2019-06-27 12:42:03,946] WARN [LeaderEpochCache visitors-0.0.1-10] New epoch > entry EpochEntry(epoch=24, startOffset=51540) caused truncation of > conflicting entries ListBuffer(EpochEntry(epoch=22, startOffset=51540)). > Cache now contains 5 entries. (kafka.server.epoch.LeaderEpochFileCache) {code} > -> This relates to cache, so I suppose it's pretty safe. > {code:java} > [2019-06-27 12:42:04,250] WARN [ReplicaManager broker=1] Leader 1 failed to > record follower 3's position 47981 since the replica is not recognized to be > one of the assigned replicas 1,2,5 for partition visitors-0.0.1-28. Empty > records will be returned for this partition. > (kafka.server.ReplicaManager){code} > -> This is scary. I'm not sure about the severity of this, but it looks like > it may be missing records? > {code:java} > [2019-06-27 12:42:03,709] WARN [ReplicaManager broker=1] While recording the > replica LEO, the partition visitors-0.0.1-58 hasn't been created. > (kafka.server.ReplicaManager){code} > -> Here, these partitions are created. > First of all - am I supposed to be missing data here? I am assuming I don't, > and so far I don't see traces of losing anything. > If so, I'm not sure what these messages are trying to say here. Should they > really be at WARN level? If so - should the message clarify better the > different risks involved? > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (KAFKA-8662) Produce fails if a previous produce was to an unauthorized topic
Rajini Sivaram created KAFKA-8662: - Summary: Produce fails if a previous produce was to an unauthorized topic Key: KAFKA-8662 URL: https://issues.apache.org/jira/browse/KAFKA-8662 Project: Kafka Issue Type: Bug Components: producer Affects Versions: 2.3.0 Reporter: Rajini Sivaram Assignee: Rajini Sivaram Fix For: 2.4.0, 2.3.1 This is a regression introduced by the commit [https://github.com/apache/kafka/commit/460e46c3bb76a361d0706b263c03696005e12566|https://github.com/apache/kafka/commit/460e46c3bb76a361d0706b263c03696005e12566.]. When we produce to a topic, was add the topic to the producer's Metadata instance. If metadata authorization fails for the topic, we fail the send and propagate the authorization exception to the caller. The topic remains in the Metadata instance. We expire the topic and remove from Metadata after a fixed interval of 5 minutes. This has been the case for a while. If a subsequent send is to a different authorized topic, we may still get metadata authorization failures for the previous unauthorized topic that is still in Metadata. Prior to that commit in 2.3.0, send to authorized topics completed successfully even if there were other unauthorized or invalid topics in the Metadata. Now, we propagate the exceptions without checking topic. This is a regression and not the expected behaviour since producer becomes unusable for 5 minutes unless authorization is granted to the first topic. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (KAFKA-6333) java.awt.headless should not be on commandline
[ https://issues.apache.org/jira/browse/KAFKA-6333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884383#comment-16884383 ] Sujay Hegde commented on KAFKA-6333: Hi, I am a newbie and wanted to work on this issue. How do I go about doing this. Please help me out. Thanks, Sujay > java.awt.headless should not be on commandline > -- > > Key: KAFKA-6333 > URL: https://issues.apache.org/jira/browse/KAFKA-6333 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 1.0.0 >Reporter: Fabrice Bacchella >Priority: Trivial > > The option -Djava.awt.headless=true is defined in KAFKA_JVM_PERFORMANCE_OPTS. > But it should even not be present on command line. It's only useful for > application that can be used in application that is used in both a headless > and a traditional environment. Kafka is a server, so it should be setup in > the main class. This help reduce clutter in command line. > See http://www.oracle.com/technetwork/articles/javase/headless-136834.html > for more details. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (KAFKA-8024) UtilsTest.testFormatBytes fails with german locale
[ https://issues.apache.org/jira/browse/KAFKA-8024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884382#comment-16884382 ] Sujay Hegde commented on KAFKA-8024: Hi, I am a newbie. How do I assign this issue to myself? I deem it to be a trivial issue and a good starting point to contribute to kafka. Thanks > UtilsTest.testFormatBytes fails with german locale > -- > > Key: KAFKA-8024 > URL: https://issues.apache.org/jira/browse/KAFKA-8024 > Project: Kafka > Issue Type: Bug >Reporter: Patrik Kleindl >Priority: Trivial > > The unit test fails when the default locale is not English (in my case, deAT) > assertEquals("1.1 MB", formatBytes((long) (1.1 * 1024 * 1024))); > > org.apache.kafka.common.utils.UtilsTest > testFormatBytes FAILED > org.junit.ComparisonFailure: expected:<1[.]1 MB> but was:<1[,]1 MB> > at org.junit.Assert.assertEquals(Assert.java:115) > at org.junit.Assert.assertEquals(Assert.java:144) > at > org.apache.kafka.common.utils.UtilsTest.testFormatBytes(UtilsTest.java:106) > > The easiest fix in this case should be adding > {code:java} > jvmArgs '-Duser.language=en -Duser.country=US'{code} > to the test configuration > [https://github.com/apache/kafka/blob/b03e8c234a8aeecd10c2c96b683cfb39b24b548a/build.gradle#L270] > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (KAFKA-8360) Docs do not mention RequestQueueSize JMX metric
[ https://issues.apache.org/jira/browse/KAFKA-8360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884343#comment-16884343 ] ASF GitHub Bot commented on KAFKA-8360: --- ankit-kumar-25 commented on pull request #220: KAFKA-8360: Docs do not mention RequestQueueSize JMX metric URL: https://github.com/apache/kafka-site/pull/220 What? :: Mentioning "Request Queue Size" under [Monitoring](https://kafka.apache.org/documentation/#monitoring) tab. RequestQueueSize is an important metric to monitor the number of requests in the queue. As a crowded queue might face issue processing incoming or outgoing requests Can you please review this? Thanks!! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Docs do not mention RequestQueueSize JMX metric > --- > > Key: KAFKA-8360 > URL: https://issues.apache.org/jira/browse/KAFKA-8360 > Project: Kafka > Issue Type: Improvement > Components: documentation, metrics, network >Reporter: Charles Francis Larrieu Casias >Assignee: Ankit Kumar >Priority: Major > Labels: documentation > > In the [monitoring > documentation|[https://kafka.apache.org/documentation/#monitoring],] there is > no mention of the `kafka.network:type=RequestChannel,name=RequestQueueSize` > JMX metric. This is an important metric because it can indicate that there > are too many requests in queue and suggest either increasing > `queued.max.requests` (along with perhaps memory), or increasing > `num.io.threads`. -- This message was sent by Atlassian JIRA (v7.6.14#76016)