[jira] [Updated] (KAFKA-9973) __consumer_offsets record is invalid lead to log clean failed and __consumer_offsets grows too big
[ https://issues.apache.org/jira/browse/KAFKA-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leibo updated KAFKA-9973: - Attachment: dump.png > __consumer_offsets record is invalid lead to log clean failed and > __consumer_offsets grows too big > -- > > Key: KAFKA-9973 > URL: https://issues.apache.org/jira/browse/KAFKA-9973 > Project: Kafka > Issue Type: Bug > Components: log cleaner >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Major > Attachments: dump.png, log-cleaner.log.1, screenshot-1.png > > > __consumer_offsets-34 grows to 13GB, and can't be compact successf, the error > log as bellow: > {code:java} > //代码占位符 > [2019-12-30 19:21:41,047] INFO [kafka-log-cleaner-thread-6]: Starting > (kafka.log.LogCleaner) > [2019-12-30 19:21:41,079] INFO [kafka-log-cleaner-thread-7]: Starting > (kafka.log.LogCleaner) > [2019-12-30 19:21:42,825] WARN [kafka-log-cleaner-thread-0]: Unexpected > exception thrown when cleaning log > Log(/home/kafka_2.12-2.1.1/data/kafka-logs/__consumer_offsets-34). Marking > its partition (__consumer_offsets-34) as uncleanable (kafka.log.LogCleaner) > org.apache.kafka.common.record.InvalidRecordException: Found invalid number > of record headers -47 > -- > [2019-12-30 20:23:51,340] INFO [kafka-log-cleaner-thread-5]: Starting > (kafka.log.LogCleaner) > [2019-12-30 20:23:51,361] INFO Cleaner 0: Building offset map for log > __consumer_offsets-34 for 52 segments in offset range [26842714, 73423730). > (kafka.log.LogCleaner) > [2019-12-30 20:23:51,398] INFO [kafka-log-cleaner-thread-6]: Starting > (kafka.log.LogCleaner) > [2019-12-30 20:23:51,493] INFO [kafka-log-cleaner-thread-7]: Starting > (kafka.log.LogCleaner) > [2019-12-30 20:23:59,596] WARN [kafka-log-cleaner-thread-0]: Unexpected > exception thrown when cleaning log > Log(/home/kafka_2.12-2.1.1/data/kafka-logs/__consumer_offsets-34). Marking > its partition (__consumer_offsets-34) as uncleanable (kafka.log.LogCleaner) > org.apache.kafka.common.record.InvalidRecordException: Found invalid number > of record headers -47 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-9973) __consumer_offsets record is invalid lead to log clean failed and __consumer_offsets grows too big
[ https://issues.apache.org/jira/browse/KAFKA-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leibo updated KAFKA-9973: - Attachment: screenshot-1.png > __consumer_offsets record is invalid lead to log clean failed and > __consumer_offsets grows too big > -- > > Key: KAFKA-9973 > URL: https://issues.apache.org/jira/browse/KAFKA-9973 > Project: Kafka > Issue Type: Bug > Components: log cleaner >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Major > Attachments: log-cleaner.log.1, screenshot-1.png > > > __consumer_offsets-34 grows to 13GB, and can't be compact successf, the error > log as bellow: > {code:java} > //代码占位符 > [2019-12-30 19:21:41,047] INFO [kafka-log-cleaner-thread-6]: Starting > (kafka.log.LogCleaner) > [2019-12-30 19:21:41,079] INFO [kafka-log-cleaner-thread-7]: Starting > (kafka.log.LogCleaner) > [2019-12-30 19:21:42,825] WARN [kafka-log-cleaner-thread-0]: Unexpected > exception thrown when cleaning log > Log(/home/kafka_2.12-2.1.1/data/kafka-logs/__consumer_offsets-34). Marking > its partition (__consumer_offsets-34) as uncleanable (kafka.log.LogCleaner) > org.apache.kafka.common.record.InvalidRecordException: Found invalid number > of record headers -47 > -- > [2019-12-30 20:23:51,340] INFO [kafka-log-cleaner-thread-5]: Starting > (kafka.log.LogCleaner) > [2019-12-30 20:23:51,361] INFO Cleaner 0: Building offset map for log > __consumer_offsets-34 for 52 segments in offset range [26842714, 73423730). > (kafka.log.LogCleaner) > [2019-12-30 20:23:51,398] INFO [kafka-log-cleaner-thread-6]: Starting > (kafka.log.LogCleaner) > [2019-12-30 20:23:51,493] INFO [kafka-log-cleaner-thread-7]: Starting > (kafka.log.LogCleaner) > [2019-12-30 20:23:59,596] WARN [kafka-log-cleaner-thread-0]: Unexpected > exception thrown when cleaning log > Log(/home/kafka_2.12-2.1.1/data/kafka-logs/__consumer_offsets-34). Marking > its partition (__consumer_offsets-34) as uncleanable (kafka.log.LogCleaner) > org.apache.kafka.common.record.InvalidRecordException: Found invalid number > of record headers -47 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-9973) __consumer_offsets record is invalid lead to log clean failed and __consumer_offsets grows too big
[ https://issues.apache.org/jira/browse/KAFKA-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leibo updated KAFKA-9973: - Description: __consumer_offsets-34 grows to 13GB, and can't be compact successf, the error log as bellow: {code:java} //代码占位符 [2019-12-30 19:21:41,047] INFO [kafka-log-cleaner-thread-6]: Starting (kafka.log.LogCleaner) [2019-12-30 19:21:41,079] INFO [kafka-log-cleaner-thread-7]: Starting (kafka.log.LogCleaner) [2019-12-30 19:21:42,825] WARN [kafka-log-cleaner-thread-0]: Unexpected exception thrown when cleaning log Log(/home/kafka_2.12-2.1.1/data/kafka-logs/__consumer_offsets-34). Marking its partition (__consumer_offsets-34) as uncleanable (kafka.log.LogCleaner) org.apache.kafka.common.record.InvalidRecordException: Found invalid number of record headers -47 -- [2019-12-30 20:23:51,340] INFO [kafka-log-cleaner-thread-5]: Starting (kafka.log.LogCleaner) [2019-12-30 20:23:51,361] INFO Cleaner 0: Building offset map for log __consumer_offsets-34 for 52 segments in offset range [26842714, 73423730). (kafka.log.LogCleaner) [2019-12-30 20:23:51,398] INFO [kafka-log-cleaner-thread-6]: Starting (kafka.log.LogCleaner) [2019-12-30 20:23:51,493] INFO [kafka-log-cleaner-thread-7]: Starting (kafka.log.LogCleaner) [2019-12-30 20:23:59,596] WARN [kafka-log-cleaner-thread-0]: Unexpected exception thrown when cleaning log Log(/home/kafka_2.12-2.1.1/data/kafka-logs/__consumer_offsets-34). Marking its partition (__consumer_offsets-34) as uncleanable (kafka.log.LogCleaner) org.apache.kafka.common.record.InvalidRecordException: Found invalid number of record headers -47 {code} was: {code:java} //代码占位符 [2019-12-30 19:21:41,047] INFO [kafka-log-cleaner-thread-6]: Starting (kafka.log.LogCleaner) [2019-12-30 19:21:41,079] INFO [kafka-log-cleaner-thread-7]: Starting (kafka.log.LogCleaner) [2019-12-30 19:21:42,825] WARN [kafka-log-cleaner-thread-0]: Unexpected exception thrown when cleaning log Log(/home/kafka_2.12-2.1.1/data/kafka-logs/__consumer_offsets-34). Marking its partition (__consumer_offsets-34) as uncleanable (kafka.log.LogCleaner) org.apache.kafka.common.record.InvalidRecordException: Found invalid number of record headers -47 -- [2019-12-30 20:23:51,340] INFO [kafka-log-cleaner-thread-5]: Starting (kafka.log.LogCleaner) [2019-12-30 20:23:51,361] INFO Cleaner 0: Building offset map for log __consumer_offsets-34 for 52 segments in offset range [26842714, 73423730). (kafka.log.LogCleaner) [2019-12-30 20:23:51,398] INFO [kafka-log-cleaner-thread-6]: Starting (kafka.log.LogCleaner) [2019-12-30 20:23:51,493] INFO [kafka-log-cleaner-thread-7]: Starting (kafka.log.LogCleaner) [2019-12-30 20:23:59,596] WARN [kafka-log-cleaner-thread-0]: Unexpected exception thrown when cleaning log Log(/home/kafka_2.12-2.1.1/data/kafka-logs/__consumer_offsets-34). Marking its partition (__consumer_offsets-34) as uncleanable (kafka.log.LogCleaner) org.apache.kafka.common.record.InvalidRecordException: Found invalid number of record headers -47 {code} > __consumer_offsets record is invalid lead to log clean failed and > __consumer_offsets grows too big > -- > > Key: KAFKA-9973 > URL: https://issues.apache.org/jira/browse/KAFKA-9973 > Project: Kafka > Issue Type: Bug > Components: log cleaner >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Major > Attachments: log-cleaner.log.1 > > > __consumer_offsets-34 grows to 13GB, and can't be compact successf, the error > log as bellow: > {code:java} > //代码占位符 > [2019-12-30 19:21:41,047] INFO [kafka-log-cleaner-thread-6]: Starting > (kafka.log.LogCleaner) > [2019-12-30 19:21:41,079] INFO [kafka-log-cleaner-thread-7]: Starting > (kafka.log.LogCleaner) > [2019-12-30 19:21:42,825] WARN [kafka-log-cleaner-thread-0]: Unexpected > exception thrown when cleaning log > Log(/home/kafka_2.12-2.1.1/data/kafka-logs/__consumer_offsets-34). Marking > its partition (__consumer_offsets-34) as uncleanable (kafka.log.LogCleaner) > org.apache.kafka.common.record.InvalidRecordException: Found invalid number > of record headers -47 > -- > [2019-12-30 20:23:51,340] INFO [kafka-log-cleaner-thread-5]: Starting > (kafka.log.LogCleaner) > [2019-12-30 20:23:51,361] INFO Cleaner 0: Building offset map for log > __consumer_offsets-34 for 52 segments in offset range [26842714, 73423730). > (kafka.log.LogCleaner) > [2019-12-30 20:23:51,398] INFO [kafka-log-cleaner-thread-6]: Starting > (kafka.log.LogCleaner) > [2019-12-30 20:23:51,493] INFO [kafka-log-cleaner-thread-7]: Starting > (kafka.log.LogCleaner) > [2019-12-30 20:23:59,596] WARN [kafka-log-cleaner-thread-0]: Unexpected > exception thrown when cleaning log > Log(/home/kafka_2.12-2.1.1/data/kafka-logs/__consumer_offsets-34). Marking > its partition (__
[jira] [Updated] (KAFKA-9973) __consumer_offsets record is invalid lead to log clean failed and __consumer_offsets grows too big
[ https://issues.apache.org/jira/browse/KAFKA-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leibo updated KAFKA-9973: - Attachment: log-cleaner.log.1 > __consumer_offsets record is invalid lead to log clean failed and > __consumer_offsets grows too big > -- > > Key: KAFKA-9973 > URL: https://issues.apache.org/jira/browse/KAFKA-9973 > Project: Kafka > Issue Type: Bug > Components: log cleaner >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Major > Attachments: log-cleaner.log.1 > > > {code:java} > //代码占位符 > [2019-12-30 19:21:41,047] INFO [kafka-log-cleaner-thread-6]: Starting > (kafka.log.LogCleaner) > [2019-12-30 19:21:41,079] INFO [kafka-log-cleaner-thread-7]: Starting > (kafka.log.LogCleaner) > [2019-12-30 19:21:42,825] WARN [kafka-log-cleaner-thread-0]: Unexpected > exception thrown when cleaning log > Log(/home/kafka_2.12-2.1.1/data/kafka-logs/__consumer_offsets-34). Marking > its partition (__consumer_offsets-34) as uncleanable (kafka.log.LogCleaner) > org.apache.kafka.common.record.InvalidRecordException: Found invalid number > of record headers -47 > -- > [2019-12-30 20:23:51,340] INFO [kafka-log-cleaner-thread-5]: Starting > (kafka.log.LogCleaner) > [2019-12-30 20:23:51,361] INFO Cleaner 0: Building offset map for log > __consumer_offsets-34 for 52 segments in offset range [26842714, 73423730). > (kafka.log.LogCleaner) > [2019-12-30 20:23:51,398] INFO [kafka-log-cleaner-thread-6]: Starting > (kafka.log.LogCleaner) > [2019-12-30 20:23:51,493] INFO [kafka-log-cleaner-thread-7]: Starting > (kafka.log.LogCleaner) > [2019-12-30 20:23:59,596] WARN [kafka-log-cleaner-thread-0]: Unexpected > exception thrown when cleaning log > Log(/home/kafka_2.12-2.1.1/data/kafka-logs/__consumer_offsets-34). Marking > its partition (__consumer_offsets-34) as uncleanable (kafka.log.LogCleaner) > org.apache.kafka.common.record.InvalidRecordException: Found invalid number > of record headers -47 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-9973) __consumer_offsets record is invalid lead to log clean failed and __consumer_offsets grows too big
leibo created KAFKA-9973: Summary: __consumer_offsets record is invalid lead to log clean failed and __consumer_offsets grows too big Key: KAFKA-9973 URL: https://issues.apache.org/jira/browse/KAFKA-9973 Project: Kafka Issue Type: Bug Components: log cleaner Affects Versions: 2.1.1 Reporter: leibo {code:java} //代码占位符 [2019-12-30 19:21:41,047] INFO [kafka-log-cleaner-thread-6]: Starting (kafka.log.LogCleaner) [2019-12-30 19:21:41,079] INFO [kafka-log-cleaner-thread-7]: Starting (kafka.log.LogCleaner) [2019-12-30 19:21:42,825] WARN [kafka-log-cleaner-thread-0]: Unexpected exception thrown when cleaning log Log(/home/kafka_2.12-2.1.1/data/kafka-logs/__consumer_offsets-34). Marking its partition (__consumer_offsets-34) as uncleanable (kafka.log.LogCleaner) org.apache.kafka.common.record.InvalidRecordException: Found invalid number of record headers -47 -- [2019-12-30 20:23:51,340] INFO [kafka-log-cleaner-thread-5]: Starting (kafka.log.LogCleaner) [2019-12-30 20:23:51,361] INFO Cleaner 0: Building offset map for log __consumer_offsets-34 for 52 segments in offset range [26842714, 73423730). (kafka.log.LogCleaner) [2019-12-30 20:23:51,398] INFO [kafka-log-cleaner-thread-6]: Starting (kafka.log.LogCleaner) [2019-12-30 20:23:51,493] INFO [kafka-log-cleaner-thread-7]: Starting (kafka.log.LogCleaner) [2019-12-30 20:23:59,596] WARN [kafka-log-cleaner-thread-0]: Unexpected exception thrown when cleaning log Log(/home/kafka_2.12-2.1.1/data/kafka-logs/__consumer_offsets-34). Marking its partition (__consumer_offsets-34) as uncleanable (kafka.log.LogCleaner) org.apache.kafka.common.record.InvalidRecordException: Found invalid number of record headers -47 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9813) __consumer_offsets loaded cost very long time
[ https://issues.apache.org/jira/browse/KAFKA-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17077968#comment-17077968 ] leibo commented on KAFKA-9813: -- [~cricket007] it's on a production environment and i could not access now. After i restart kafka cluster, it recovery to normal. there was no exception found in server.log Similar to kafka-9065. > __consumer_offsets loaded cost very long time > - > > Key: KAFKA-9813 > URL: https://issues.apache.org/jira/browse/KAFKA-9813 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Major > Attachments: 2020-04-03_163556.png > > > __consumer_offsets loaded with a long time. > > After I restart kafka(3 HA), one of __consumer_offsets loaded with 41256909 > ms (about 11 hours), the others __consumer_offsets partition that load after > __consumer_offsets-14 can not be loaded , that lead many consumer can not > commit offsets and consume. > restart time: 2020-04-02 19:06 > __consumer_offsets-14 loaded time: 2020-04-03 06:32, 41256909 ms > other info: > 1. there are 72408 consumer groups in kafka cluster, and most of it > (70498) are empty. > 2. there was no exception can be found in server.log > 3. the disk i/o and cpu ,memory load are very low. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9813) __consumer_offsets loaded cost very long time
[ https://issues.apache.org/jira/browse/KAFKA-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17077887#comment-17077887 ] leibo commented on KAFKA-9813: -- [~cricket007] log.retention.bytes=104857600 log.segment.bytes=104857600 > __consumer_offsets loaded cost very long time > - > > Key: KAFKA-9813 > URL: https://issues.apache.org/jira/browse/KAFKA-9813 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Major > Attachments: 2020-04-03_163556.png > > > __consumer_offsets loaded with a long time. > > After I restart kafka(3 HA), one of __consumer_offsets loaded with 41256909 > ms (about 11 hours), the others __consumer_offsets partition that load after > __consumer_offsets-14 can not be loaded , that lead many consumer can not > commit offsets and consume. > restart time: 2020-04-02 19:06 > __consumer_offsets-14 loaded time: 2020-04-03 06:32, 41256909 ms > other info: > 1. there are 72408 consumer groups in kafka cluster, and most of it > (70498) are empty. > 2. there was no exception can be found in server.log > 3. the disk i/o and cpu ,memory load are very low. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-9813) __consumer_offsets loaded cost very long time
[ https://issues.apache.org/jira/browse/KAFKA-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leibo updated KAFKA-9813: - Description: __consumer_offsets loaded with a long time. After I restart kafka(3 HA), one of __consumer_offsets loaded with 41256909 ms (about 11 hours), the others __consumer_offsets partition that load after __consumer_offsets-14 can not be loaded , that lead many consumer can not commit offsets and consume. restart time: 2020-04-02 19:06 __consumer_offsets-14 loaded time: 2020-04-03 06:32, 41256909 ms other info: 1. there are 72408 consumer groups in kafka cluster, and most of it (70498) are empty. 2. there was no exception can be found in server.log 3. the disk i/o and cpu ,memory load are very low. was: __consumer_offsets loaded with a long time. After I restart kafka(3 HA), one of __consumer_offsets loaded with 41256909 ms (about 11 hours), the others __consumer_offsets partition that load after __consumer_offsets-14 can not be loaded , that lead many consumer can not commit offsets and consume. restart time: 2020-04-02 19:06 __consumer_offsets-14 loaded time: 2020-04-03 06:32, 41256909 ms other info: 1. there are 72408 consumer group in kafka cluster, and most of it (70498) are empty. 2. there was no exception can be found in server.log 3. the disk i/o and cpu ,memory load are very low. > __consumer_offsets loaded cost very long time > - > > Key: KAFKA-9813 > URL: https://issues.apache.org/jira/browse/KAFKA-9813 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Major > Attachments: 2020-04-03_163556.png > > > __consumer_offsets loaded with a long time. > > After I restart kafka(3 HA), one of __consumer_offsets loaded with 41256909 > ms (about 11 hours), the others __consumer_offsets partition that load after > __consumer_offsets-14 can not be loaded , that lead many consumer can not > commit offsets and consume. > restart time: 2020-04-02 19:06 > __consumer_offsets-14 loaded time: 2020-04-03 06:32, 41256909 ms > other info: > 1. there are 72408 consumer groups in kafka cluster, and most of it > (70498) are empty. > 2. there was no exception can be found in server.log > 3. the disk i/o and cpu ,memory load are very low. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-9813) __consumer_offsets loaded cost very long time
[ https://issues.apache.org/jira/browse/KAFKA-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leibo updated KAFKA-9813: - Description: __consumer_offsets loaded with a long time. After I restart kafka(3 HA), one of __consumer_offsets loaded with 41256909 ms (about 11 hours), the others __consumer_offsets partition that load after __consumer_offsets-14 can not be loaded , that lead many consumer can not commit offsets and consume. restart time: 2020-04-02 19:06 __consumer_offsets-14 loaded time: 2020-04-03 06:32, 41256909 ms other info: 1. there are 72408 consumer group in kafka cluster, and most of it (70498) are empty. 2. there was no exception can be found in server.log 3. the disk i/o and cpu ,memory load are very low. was: __consumer_offsets loaded with a long time. After I restart kafka(3 HA), one of __consumer_offsets loaded with 41256909 ms (about 11 hours), and many consumer can not commit offsets. restart time: 2020-04-02 19:06 __consumer_offsets-14 loaded time: 2020-04-03 06:32, 41256909 ms other info: 1. there are 72408 consumer group in kafka cluster, and most of it (70498) are empty. 2. there was no exception can be found in server.log 3. the disk i/o and cpu ,memory load are very low. > __consumer_offsets loaded cost very long time > - > > Key: KAFKA-9813 > URL: https://issues.apache.org/jira/browse/KAFKA-9813 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Major > Attachments: 2020-04-03_163556.png > > > __consumer_offsets loaded with a long time. > > After I restart kafka(3 HA), one of __consumer_offsets loaded with 41256909 > ms (about 11 hours), the others __consumer_offsets partition that load after > __consumer_offsets-14 can not be loaded , that lead many consumer can not > commit offsets and consume. > restart time: 2020-04-02 19:06 > __consumer_offsets-14 loaded time: 2020-04-03 06:32, 41256909 ms > other info: > 1. there are 72408 consumer group in kafka cluster, and most of it > (70498) are empty. > 2. there was no exception can be found in server.log > 3. the disk i/o and cpu ,memory load are very low. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-9813) __consumer_offsets loaded cost very long time
[ https://issues.apache.org/jira/browse/KAFKA-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leibo updated KAFKA-9813: - Summary: __consumer_offsets loaded cost very long time (was: __consumer_offsets loaded with a very long time) > __consumer_offsets loaded cost very long time > - > > Key: KAFKA-9813 > URL: https://issues.apache.org/jira/browse/KAFKA-9813 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Major > Attachments: 2020-04-03_163556.png > > > __consumer_offsets loaded with a long time. > > After I restart kafka(3 HA), one of __consumer_offsets loaded with 41256909 > ms (about 11 hours), and many consumer can not commit offsets. > restart time: 2020-04-02 19:06 > __consumer_offsets-14 loaded time: 2020-04-03 06:32, 41256909 ms > other info: > 1. there are 72408 consumer group in kafka cluster, and most of it > (70498) are empty. > 2. there was no exception can be found in server.log > 3. the disk i/o and cpu ,memory load are very low. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-9813) __consumer_offsets loaded with a very long time
[ https://issues.apache.org/jira/browse/KAFKA-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leibo updated KAFKA-9813: - Description: __consumer_offsets loaded with a long time. After I restart kafka(3 HA), one of __consumer_offsets loaded with 41256909 ms (about 11 hours), and many consumer can not commit offsets. restart time: 2020-04-02 19:06 __consumer_offsets-14 loaded time: 2020-04-03 06:32, 41256909 ms other info: 1. there are 72408 consumer group in kafka cluster, and most of it (70498) are empty. 2. there was no exception can be found in server.log 3. the disk i/o and cpu ,memory load are very low. was: __consumer_offsets loaded with a long time. After I restart kafka(3 HA), one of __consumer_offsets loaded with 41256909 ms (about 11 hours), and many consumer can not commit offsets. restart time: 2020-04-02 19:06 __consumer_offsets-14 loaded time: 2020-04-03 06:32, 41256909 ms other info: 1. there are 72408 consumer group in kafka cluster, and most of it (70498) are empty. 2. there was no exception can be found in server.log > __consumer_offsets loaded with a very long time > --- > > Key: KAFKA-9813 > URL: https://issues.apache.org/jira/browse/KAFKA-9813 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Major > Attachments: 2020-04-03_163556.png > > > __consumer_offsets loaded with a long time. > > After I restart kafka(3 HA), one of __consumer_offsets loaded with 41256909 > ms (about 11 hours), and many consumer can not commit offsets. > restart time: 2020-04-02 19:06 > __consumer_offsets-14 loaded time: 2020-04-03 06:32, 41256909 ms > other info: > 1. there are 72408 consumer group in kafka cluster, and most of it > (70498) are empty. > 2. there was no exception can be found in server.log > 3. the disk i/o and cpu ,memory load are very low. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-9813) __consumer_offsets loaded with a very long time
[ https://issues.apache.org/jira/browse/KAFKA-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leibo updated KAFKA-9813: - Attachment: 2020-04-03_163556.png > __consumer_offsets loaded with a very long time > --- > > Key: KAFKA-9813 > URL: https://issues.apache.org/jira/browse/KAFKA-9813 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Major > Attachments: 2020-04-03_163556.png > > > __consumer_offsets loaded with a long time. > > After I restart kafka(3 HA), one of __consumer_offsets loaded with 41256909 > ms (about 11 hours), and many consumer can not commit offsets. > restart time: 2020-04-02 19:06 > __consumer_offsets-14 loaded time: 2020-04-03 06:32, 41256909 ms > other info: > there are 72408 consumer group in kafka cluster, and most of it (70498) > are empty. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-9813) __consumer_offsets loaded with a very long time
[ https://issues.apache.org/jira/browse/KAFKA-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leibo updated KAFKA-9813: - Description: __consumer_offsets loaded with a long time. After I restart kafka(3 HA), one of __consumer_offsets loaded with 41256909 ms (about 11 hours), and many consumer can not commit offsets. restart time: 2020-04-02 19:06 __consumer_offsets-14 loaded time: 2020-04-03 06:32, 41256909 ms other info: 1. there are 72408 consumer group in kafka cluster, and most of it (70498) are empty. 2. there was no exception can be found in server.log was: __consumer_offsets loaded with a long time. After I restart kafka(3 HA), one of __consumer_offsets loaded with 41256909 ms (about 11 hours), and many consumer can not commit offsets. restart time: 2020-04-02 19:06 __consumer_offsets-14 loaded time: 2020-04-03 06:32, 41256909 ms other info: there are 72408 consumer group in kafka cluster, and most of it (70498) are empty. > __consumer_offsets loaded with a very long time > --- > > Key: KAFKA-9813 > URL: https://issues.apache.org/jira/browse/KAFKA-9813 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Major > Attachments: 2020-04-03_163556.png > > > __consumer_offsets loaded with a long time. > > After I restart kafka(3 HA), one of __consumer_offsets loaded with 41256909 > ms (about 11 hours), and many consumer can not commit offsets. > restart time: 2020-04-02 19:06 > __consumer_offsets-14 loaded time: 2020-04-03 06:32, 41256909 ms > other info: > 1. there are 72408 consumer group in kafka cluster, and most of it > (70498) are empty. > 2. there was no exception can be found in server.log > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-9813) __consumer_offsets loaded with a very long time
leibo created KAFKA-9813: Summary: __consumer_offsets loaded with a very long time Key: KAFKA-9813 URL: https://issues.apache.org/jira/browse/KAFKA-9813 Project: Kafka Issue Type: Bug Components: core Affects Versions: 2.1.1 Reporter: leibo __consumer_offsets loaded with a long time. After I restart kafka(3 HA), one of __consumer_offsets loaded with 41256909 ms (about 11 hours), and many consumer can not commit offsets. restart time: 2020-04-02 19:06 __consumer_offsets-14 loaded time: 2020-04-03 06:32, 41256909 ms other info: there are 72408 consumer group in kafka cluster, and most of it (70498) are empty. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17020027#comment-17020027 ] leibo edited comment on KAFKA-8532 at 1/22/20 12:56 AM: [~junrao] I do a further analysis as below: zk client heartbeat thread org.apache.zookeeper.ClientCnxn$SendThread {code:java} SendThread(ClientCnxnSocket clientCnxnSocket) { super(ClientCnxn.makeThreadName("-SendThread()")); ClientCnxn.this.state = States.CONNECTING;//init state this.clientCnxnSocket = clientCnxnSocket; this.setDaemon(true); } {code} As we can see, zk heartbeat thread SendThread init state is *CONNECTING*, And when zk session expired, zookeeper.getState() is *CONNECTING*. {code:java} //代码占位符 public enum States { CONNECTING, ASSOCIATING, CONNECTED, CONNECTEDREADONLY, CLOSED, AUTH_FAILED, NOT_CONNECTED; public boolean isAlive() { // when zk client state is CONNECTING or NOT_CONNECTED, obviously, zk session is not valid, we can't regard as zookeeper is Alive return this != CLOSED && this != AUTH_FAILED; } public boolean isConnected() { return this == CONNECTED || this == CONNECTEDREADONLY; } } {code} According to above code, we can see that when zk session expired, ClientCnxn.SendThread will attempt to reconnect to zk sever and establish a new session. It's init state is CONNECTING, if zk-session-expired-thread enter to reinitialize method with zookeeper state CONNECTING, it will skip if *(!connectionState.isAlive)* condition and execute callAfterInitializingSession , trigger controller-event-thread process RegisterBrokerAndReelect, while zk session is invalid now. was (Author: lbdai3190): [~junrao] I do a further analysis as below: zk client heartbeat thread org.apache.zookeeper.ClientCnxn$SendThread {code:java} SendThread(ClientCnxnSocket clientCnxnSocket) { super(ClientCnxn.makeThreadName("-SendThread()")); ClientCnxn.this.state = States.CONNECTING;//init state this.clientCnxnSocket = clientCnxnSocket; this.setDaemon(true); } {code} As we can see, zk heartbeat thread SendThread init state is *CONNECTING*, {code:java} //代码占位符 public enum States { CONNECTING, ASSOCIATING, CONNECTED, CONNECTEDREADONLY, CLOSED, AUTH_FAILED, NOT_CONNECTED; public boolean isAlive() { // when zk client state is CONNECTING or NOT_CONNECTED, obviously, zk session is not valid, we can't regard as zookeeper is Alive return this != CLOSED && this != AUTH_FAILED; } public boolean isConnected() { return this == CONNECTED || this == CONNECTEDREADONLY; } } {code} According to above code, we can see that when zk session expired, ClientCnxn.SendThread will attempt to reconnect to zk sever and establish a new session. It's init state is CONNECTING, if zk-session-expired-thread enter to reinitialize method with zookeeper state CONNECTING, it will skip if *(!connectionState.isAlive)* condition and execute callAfterInitializingSession , trigger controller-event-thread process RegisterBrokerAndReelect, while zk session is invalid now. > controller-event-thread deadlock with zk-session-expiry-handler0 > > > Key: KAFKA-8532 > URL: https://issues.apache.org/jira/browse/KAFKA-8532 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Blocker > Attachments: controllerDeadLockAnalysis-2020-01-20.png, js.log, > js0.log, js1.log, js2.log > > > We have observed a serious deadlock between controller-event-thead and > zk-session-expirey-handle thread. When this issue occurred, it's only one way > to recovery the kafka cluster is restart kafka server. The follows is the > jstack log of controller-event-thead and zk-session-expiry-handle thread. > "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 > tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005ee3f7000> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:23
[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17020027#comment-17020027 ] leibo edited comment on KAFKA-8532 at 1/21/20 8:51 AM: --- [~junrao] I do a further analysis as below: zk client heartbeat thread org.apache.zookeeper.ClientCnxn$SendThread {code:java} SendThread(ClientCnxnSocket clientCnxnSocket) { super(ClientCnxn.makeThreadName("-SendThread()")); ClientCnxn.this.state = States.CONNECTING;//init state this.clientCnxnSocket = clientCnxnSocket; this.setDaemon(true); } {code} As we can see, zk heartbeat thread SendThread init state is *CONNECTING*, {code:java} //代码占位符 public enum States { CONNECTING, ASSOCIATING, CONNECTED, CONNECTEDREADONLY, CLOSED, AUTH_FAILED, NOT_CONNECTED; public boolean isAlive() { // when zk client state is CONNECTING or NOT_CONNECTED, obviously, zk session is not valid, we can't regard as zookeeper is Alive return this != CLOSED && this != AUTH_FAILED; } public boolean isConnected() { return this == CONNECTED || this == CONNECTEDREADONLY; } } {code} According to above code, we can see that when zk session expired, ClientCnxn.SendThread will attempt to reconnect to zk sever and establish a new session. It's init state is CONNECTING, if zk-session-expired-thread enter to reinitialize method with zookeeper state CONNECTING, it will skip if *(!connectionState.isAlive)* condition and execute callAfterInitializingSession , trigger controller-event-thread process RegisterBrokerAndReelect, while zk session is invalid now. was (Author: lbdai3190): zk client heartbeat thread org.apache.zookeeper.ClientCnxn$SendThread {code:java} SendThread(ClientCnxnSocket clientCnxnSocket) { super(ClientCnxn.makeThreadName("-SendThread()")); ClientCnxn.this.state = States.CONNECTING;//init state this.clientCnxnSocket = clientCnxnSocket; this.setDaemon(true); } {code} As we can see, zk heartbeat thread SendThread init state is *CONNECTING*, {code:java} //代码占位符 public enum States { CONNECTING, ASSOCIATING, CONNECTED, CONNECTEDREADONLY, CLOSED, AUTH_FAILED, NOT_CONNECTED; public boolean isAlive() { // when zk client state is CONNECTING or NOT_CONNECTED, obviously, zk session is not valid, we can't regard as zookeeper is Alive return this != CLOSED && this != AUTH_FAILED; } public boolean isConnected() { return this == CONNECTED || this == CONNECTEDREADONLY; } } {code} According to above code, we can found that when zk session expired, ClientCnxn.SendThread will attempt to reconnect to zk sever and establish a new session. It's init state is CONNECTING, if zk-session-expired-thread enter to reinitialize method with zookeeper state CONNECTING, it will skip if *(!connectionState.isAlive)* condition and execute callAfterInitializingSession , trigger controller-event-thread process RegisterBrokerAndReelect, while zk session is invalid now. > controller-event-thread deadlock with zk-session-expiry-handler0 > > > Key: KAFKA-8532 > URL: https://issues.apache.org/jira/browse/KAFKA-8532 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Blocker > Attachments: controllerDeadLockAnalysis-2020-01-20.png, js.log, > js0.log, js1.log, js2.log > > > We have observed a serious deadlock between controller-event-thead and > zk-session-expirey-handle thread. When this issue occurred, it's only one way > to recovery the kafka cluster is restart kafka server. The follows is the > jstack log of controller-event-thead and zk-session-expiry-handle thread. > "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 > tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005ee3f7000> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // > 等待controller-event-thread线程处理expireEvent > at > kafka.controller.KafkaController$Expire.waitUntilProce
[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17020027#comment-17020027 ] leibo edited comment on KAFKA-8532 at 1/21/20 8:49 AM: --- zk client heartbeat thread org.apache.zookeeper.ClientCnxn$SendThread {code:java} SendThread(ClientCnxnSocket clientCnxnSocket) { super(ClientCnxn.makeThreadName("-SendThread()")); ClientCnxn.this.state = States.CONNECTING;//init state this.clientCnxnSocket = clientCnxnSocket; this.setDaemon(true); } {code} As we can see, zk heartbeat thread SendThread init state is *CONNECTING*, {code:java} //代码占位符 public enum States { CONNECTING, ASSOCIATING, CONNECTED, CONNECTEDREADONLY, CLOSED, AUTH_FAILED, NOT_CONNECTED; public boolean isAlive() { // when zk client state is CONNECTING or NOT_CONNECTED, obviously, zk session is not valid, we can't regard as zookeeper is Alive return this != CLOSED && this != AUTH_FAILED; } public boolean isConnected() { return this == CONNECTED || this == CONNECTEDREADONLY; } } {code} According to above code, we can found that when zk session expired, ClientCnxn.SendThread will attempt to reconnect to zk sever and establish a new session. It's init state is CONNECTING, if zk-session-expired-thread enter to reinitialize method with zookeeper state CONNECTING, it will skip if *(!connectionState.isAlive)* condition and execute callAfterInitializingSession , trigger controller-event-thread process RegisterBrokerAndReelect, while zk session is invalid now. was (Author: lbdai3190): {code:java} //代码占位符 public enum States { CONNECTING, ASSOCIATING, CONNECTED, CONNECTEDREADONLY, CLOSED, AUTH_FAILED, NOT_CONNECTED; public boolean isAlive() { // when zk client state is CONNECTING or NOT_CONNECTED, obviously, zk session is not valid, we can't regard as zookeeper is Alive return this != CLOSED && this != AUTH_FAILED; } /** * Returns whether we are connected to a server (which * could possibly be read-only, if this client is allowed * to go to read-only mode) * */ public boolean isConnected() { return this == CONNECTED || this == CONNECTEDREADONLY; } } {code} > controller-event-thread deadlock with zk-session-expiry-handler0 > > > Key: KAFKA-8532 > URL: https://issues.apache.org/jira/browse/KAFKA-8532 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Blocker > Attachments: controllerDeadLockAnalysis-2020-01-20.png, js.log, > js0.log, js1.log, js2.log > > > We have observed a serious deadlock between controller-event-thead and > zk-session-expirey-handle thread. When this issue occurred, it's only one way > to recovery the kafka cluster is restart kafka server. The follows is the > jstack log of controller-event-thead and zk-session-expiry-handle thread. > "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 > tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005ee3f7000> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // > 等待controller-event-thread线程处理expireEvent > at > kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533) > at > kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173) > at > kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374) > at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown > Source) > at scala.collection.Iterator.foreach(Iterator.scala:937) > at scala.collection.Iterator.foreach$(Iterator.scala:937) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1425) > at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209) > at kafka.zookeep
[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019851#comment-17019851 ] leibo edited comment on KAFKA-8532 at 1/21/20 8:24 AM: --- [~junrao] I'm agree with your opinion "all pending ZK requests should complete with a SessionExpired error through the response callback". As the attachment controllerDeadLockAnalysis-2020-01-20.png I uploaded. RegisterBrokerAndReelect need a established zookeeper session to process finish by controller-event-thread,Expired need be process by controller-event-thread to established zookeeper session, but controller-event-thread is busy to handing RegisterBrokerAndReelect. According to current analysis, I think it's tow ways to deal with this problem: 1. Use more strict judgment conditions to prevent ControllerEvent RegisterBrokerAndReelect to be execute when zk session expired. 2. Found out the reason why handleRequest blocked when zookeeper session Expired. was (Author: lbdai3190): [~junrao] I'm agree with your opinion "all pending ZK requests should complete with a SessionExpired error through the response callback". As the attachment controllerDeadLockAnalysis-2020-01-20.png I uploaded. RegisterBrokerAndReelect need a established zookeeper session to process finish by controller-event-thread,Expired need be process by controller-event-thread to established zookeeper session, but controller-event-thread is busy to handing RegisterBrokerAndReelect. According to current analysis, I think it's tow ways to deal with this problem: 1. Use more strict judgment conditions to prevent ControllerEvent RegisterBrokerAndReelect to be execute when zk session expired. 2. Found out the reason why handleRequest blocked when zookeeper session Expired. > controller-event-thread deadlock with zk-session-expiry-handler0 > > > Key: KAFKA-8532 > URL: https://issues.apache.org/jira/browse/KAFKA-8532 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Blocker > Attachments: controllerDeadLockAnalysis-2020-01-20.png, js.log, > js0.log, js1.log, js2.log > > > We have observed a serious deadlock between controller-event-thead and > zk-session-expirey-handle thread. When this issue occurred, it's only one way > to recovery the kafka cluster is restart kafka server. The follows is the > jstack log of controller-event-thead and zk-session-expiry-handle thread. > "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 > tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005ee3f7000> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // > 等待controller-event-thread线程处理expireEvent > at > kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533) > at > kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173) > at > kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374) > at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown > Source) > at scala.collection.Iterator.foreach(Iterator.scala:937) > at scala.collection.Iterator.foreach$(Iterator.scala:937) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1425) > at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209) > at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428) > at > kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown > Source) > at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114) > at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown > Source) > at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63) > at
[jira] [Commented] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17020027#comment-17020027 ] leibo commented on KAFKA-8532: -- {code:java} //代码占位符 public enum States { CONNECTING, ASSOCIATING, CONNECTED, CONNECTEDREADONLY, CLOSED, AUTH_FAILED, NOT_CONNECTED; public boolean isAlive() { // when zk client state is CONNECTING or NOT_CONNECTED, obviously, zk session is not valid, we can't regard as zookeeper is Alive return this != CLOSED && this != AUTH_FAILED; } /** * Returns whether we are connected to a server (which * could possibly be read-only, if this client is allowed * to go to read-only mode) * */ public boolean isConnected() { return this == CONNECTED || this == CONNECTEDREADONLY; } } {code} > controller-event-thread deadlock with zk-session-expiry-handler0 > > > Key: KAFKA-8532 > URL: https://issues.apache.org/jira/browse/KAFKA-8532 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Blocker > Attachments: controllerDeadLockAnalysis-2020-01-20.png, js.log, > js0.log, js1.log, js2.log > > > We have observed a serious deadlock between controller-event-thead and > zk-session-expirey-handle thread. When this issue occurred, it's only one way > to recovery the kafka cluster is restart kafka server. The follows is the > jstack log of controller-event-thead and zk-session-expiry-handle thread. > "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 > tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005ee3f7000> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // > 等待controller-event-thread线程处理expireEvent > at > kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533) > at > kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173) > at > kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374) > at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown > Source) > at scala.collection.Iterator.foreach(Iterator.scala:937) > at scala.collection.Iterator.foreach$(Iterator.scala:937) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1425) > at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209) > at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428) > at > kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown > Source) > at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114) > at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown > Source) > at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Locked ownable synchronizers: > - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker) > "controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 > nid=0x310 waiting on condition [0x7fccb55c8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005d1be5a00> (a > java.util.concurrent.CountD
[jira] [Commented] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019851#comment-17019851 ] leibo commented on KAFKA-8532: -- [~junrao] I'm agree with your opinion "all pending ZK requests should complete with a SessionExpired error through the response callback". As the attachment controllerDeadLockAnalysis-2020-01-20.png I uploaded. RegisterBrokerAndReelect need a established zookeeper session to process finish by controller-event-thread,Expired need be process by controller-event-thread to established zookeeper session, but controller-event-thread is busy to handing RegisterBrokerAndReelect. According to current analysis, I think it's tow ways to deal with this problem: 1. Use more strict judgment conditions to prevent ControllerEvent RegisterBrokerAndReelect to be execute when zk session expired. 2. Found out the reason why handleRequest blocked when zookeeper session Expired. > controller-event-thread deadlock with zk-session-expiry-handler0 > > > Key: KAFKA-8532 > URL: https://issues.apache.org/jira/browse/KAFKA-8532 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Blocker > Attachments: controllerDeadLockAnalysis-2020-01-20.png, js.log, > js0.log, js1.log, js2.log > > > We have observed a serious deadlock between controller-event-thead and > zk-session-expirey-handle thread. When this issue occurred, it's only one way > to recovery the kafka cluster is restart kafka server. The follows is the > jstack log of controller-event-thead and zk-session-expiry-handle thread. > "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 > tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005ee3f7000> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // > 等待controller-event-thread线程处理expireEvent > at > kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533) > at > kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173) > at > kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374) > at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown > Source) > at scala.collection.Iterator.foreach(Iterator.scala:937) > at scala.collection.Iterator.foreach$(Iterator.scala:937) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1425) > at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209) > at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428) > at > kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown > Source) > at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114) > at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown > Source) > at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Locked ownable synchronizers: > - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker) > "controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 > nid=0x310 waiting on condition [0x7fccb55c8000] > java.lang.Thread.State:
[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019294#comment-17019294 ] leibo edited comment on KAFKA-8532 at 1/21/20 1:44 AM: --- [~junrao], [~yuzhih...@gmail.com] Thank you for you reply, 1. I have tested the Exception in handleRequest(), and as you a said , excpetions throws in handleRequest method will not block on countDownLatch.await(). 2. I recheck the jstack log in attachment, and there is my analysis result as attachment *controllerDeadLockAnalysis-2020-01-20.png,* {code:java} //代码占位符 private def reinitialize(): Unit = { // Initialization callbacks are invoked outside of the lock to avoid deadlock potential since their completion // may require additional Zookeeper requests, which will block to acquire the initialization lock stateChangeHandlers.values.foreach(callBeforeInitializingSession _) inWriteLock(initializationLock) { // if (!connectionState.isAlive) { // 此处判断可能不准确 if (!connectionState.isConnected) { // If use isConnected to determine whether need reinitialize zookeeper session, program will not skip this and to execute some operation that need zookeeper be connected. zooKeeper.close() info(s"Initializing a new session to $connectString.") // retry forever until ZooKeeper can be instantiated var connected = false while (!connected) { try { zooKeeper = new ZooKeeper(connectString, sessionTimeoutMs, ZooKeeperClientWatcher) connected = true } catch { case e: Exception => info("Error when recreating ZooKeeper, retrying after a short sleep", e) Thread.sleep(1000) } } } } stateChangeHandlers.values.foreach(callAfterInitializingSession _) }{code} Zookeeper Code: {code:java} //代码占位符 @InterfaceAudience.Public public enum States { CONNECTING, ASSOCIATING, CONNECTED, CONNECTEDREADONLY, CLOSED, AUTH_FAILED, NOT_CONNECTED; public boolean isAlive() { return this != CLOSED && this != AUTH_FAILED; } /** * Returns whether we are connected to a server (which * could possibly be read-only, if this client is allowed * to go to read-only mode) * */ public boolean isConnected() { return this == CONNECTED || this == CONNECTEDREADONLY; } } {code} I hope you pay attention on it. was (Author: lbdai3190): [~junrao], [~yuzhih...@gmail.com] Thank you for you reply, 1. I have tested the Exception in handleRequest(), and as you a said , excpetions throws in handleRequest method will not block on countDownLatch.await(). 2. I recheck the jstack log in attachment, and there is my analysis result as attachment controllerDeadLockAnalysis-2020-01-20.png, {code:java} //代码占位符 private def reinitialize(): Unit = { // Initialization callbacks are invoked outside of the lock to avoid deadlock potential since their completion // may require additional Zookeeper requests, which will block to acquire the initialization lock stateChangeHandlers.values.foreach(callBeforeInitializingSession _) inWriteLock(initializationLock) { // if (!connectionState.isAlive) { // 此处判断可能不准确 if (!connectionState.isConnected) { // If use isConnected to determine whether need reinitialize zookeeper session, program will not skip this and to execute some operation that need zookeeper be connected. zooKeeper.close() info(s"Initializing a new session to $connectString.") // retry forever until ZooKeeper can be instantiated var connected = false while (!connected) { try { zooKeeper = new ZooKeeper(connectString, sessionTimeoutMs, ZooKeeperClientWatcher) connected = true } catch { case e: Exception => info("Error when recreating ZooKeeper, retrying after a short sleep", e) Thread.sleep(1000) } } } } stateChangeHandlers.values.foreach(callAfterInitializingSession _) }{code} Zookeeper Code: {code:java} //代码占位符 @InterfaceAudience.Public public enum States { CONNECTING, ASSOCIATING, CONNECTED, CONNECTEDREADONLY, CLOSED, AUTH_FAILED, NOT_CONNECTED; public boolean isAlive() { return this != CLOSED && this != AUTH_FAILED; } /** * Returns whether we are connected to a server (which * could possibly be read-only, if this client is allowed * to go to read-only mode) * */ public boolean isConnected() { return this == CONNECTED || this == CONNECTEDREADONLY; } } {code} I hope you pay attention on it. > controller-event-thread deadlock with zk-session-expiry-handler0 > > > Key: KAFKA-8532 > URL: https://issues.apache.org/jira/browse/KAFKA-
[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019294#comment-17019294 ] leibo edited comment on KAFKA-8532 at 1/21/20 1:44 AM: --- [~junrao], [~yuzhih...@gmail.com] Thank you for you reply, 1. I have tested the Exception in handleRequest(), and as you a said , excpetions throws in handleRequest method will not block on countDownLatch.await(). 2. I recheck the jstack log in attachment, and there is my analysis result as *attachment* *controllerDeadLockAnalysis-2020-01-20.png,* {code:java} //代码占位符 private def reinitialize(): Unit = { // Initialization callbacks are invoked outside of the lock to avoid deadlock potential since their completion // may require additional Zookeeper requests, which will block to acquire the initialization lock stateChangeHandlers.values.foreach(callBeforeInitializingSession _) inWriteLock(initializationLock) { // if (!connectionState.isAlive) { // 此处判断可能不准确 if (!connectionState.isConnected) { // If use isConnected to determine whether need reinitialize zookeeper session, program will not skip this and to execute some operation that need zookeeper be connected. zooKeeper.close() info(s"Initializing a new session to $connectString.") // retry forever until ZooKeeper can be instantiated var connected = false while (!connected) { try { zooKeeper = new ZooKeeper(connectString, sessionTimeoutMs, ZooKeeperClientWatcher) connected = true } catch { case e: Exception => info("Error when recreating ZooKeeper, retrying after a short sleep", e) Thread.sleep(1000) } } } } stateChangeHandlers.values.foreach(callAfterInitializingSession _) }{code} Zookeeper Code: {code:java} //代码占位符 @InterfaceAudience.Public public enum States { CONNECTING, ASSOCIATING, CONNECTED, CONNECTEDREADONLY, CLOSED, AUTH_FAILED, NOT_CONNECTED; public boolean isAlive() { return this != CLOSED && this != AUTH_FAILED; } /** * Returns whether we are connected to a server (which * could possibly be read-only, if this client is allowed * to go to read-only mode) * */ public boolean isConnected() { return this == CONNECTED || this == CONNECTEDREADONLY; } } {code} I hope you pay attention on it. was (Author: lbdai3190): [~junrao], [~yuzhih...@gmail.com] Thank you for you reply, 1. I have tested the Exception in handleRequest(), and as you a said , excpetions throws in handleRequest method will not block on countDownLatch.await(). 2. I recheck the jstack log in attachment, and there is my analysis result as attachment *controllerDeadLockAnalysis-2020-01-20.png,* {code:java} //代码占位符 private def reinitialize(): Unit = { // Initialization callbacks are invoked outside of the lock to avoid deadlock potential since their completion // may require additional Zookeeper requests, which will block to acquire the initialization lock stateChangeHandlers.values.foreach(callBeforeInitializingSession _) inWriteLock(initializationLock) { // if (!connectionState.isAlive) { // 此处判断可能不准确 if (!connectionState.isConnected) { // If use isConnected to determine whether need reinitialize zookeeper session, program will not skip this and to execute some operation that need zookeeper be connected. zooKeeper.close() info(s"Initializing a new session to $connectString.") // retry forever until ZooKeeper can be instantiated var connected = false while (!connected) { try { zooKeeper = new ZooKeeper(connectString, sessionTimeoutMs, ZooKeeperClientWatcher) connected = true } catch { case e: Exception => info("Error when recreating ZooKeeper, retrying after a short sleep", e) Thread.sleep(1000) } } } } stateChangeHandlers.values.foreach(callAfterInitializingSession _) }{code} Zookeeper Code: {code:java} //代码占位符 @InterfaceAudience.Public public enum States { CONNECTING, ASSOCIATING, CONNECTED, CONNECTEDREADONLY, CLOSED, AUTH_FAILED, NOT_CONNECTED; public boolean isAlive() { return this != CLOSED && this != AUTH_FAILED; } /** * Returns whether we are connected to a server (which * could possibly be read-only, if this client is allowed * to go to read-only mode) * */ public boolean isConnected() { return this == CONNECTED || this == CONNECTEDREADONLY; } } {code} I hope you pay attention on it. > controller-event-thread deadlock with zk-session-expiry-handler0 > > > Key: KAFKA-8532 > URL: https://issues.apache.org/jira/browse/KA
[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019294#comment-17019294 ] leibo edited comment on KAFKA-8532 at 1/21/20 1:43 AM: --- [~junrao], [~yuzhih...@gmail.com] Thank you for you reply, 1. I have tested the Exception in handleRequest(), and as you a said , excpetions throws in handleRequest method will not block on countDownLatch.await(). 2. I recheck the jstack log in attachment, and there is my analysis result as attachment controllerDeadLockAnalysis-2020-01-20.png, {code:java} //代码占位符 private def reinitialize(): Unit = { // Initialization callbacks are invoked outside of the lock to avoid deadlock potential since their completion // may require additional Zookeeper requests, which will block to acquire the initialization lock stateChangeHandlers.values.foreach(callBeforeInitializingSession _) inWriteLock(initializationLock) { // if (!connectionState.isAlive) { // 此处判断可能不准确 if (!connectionState.isConnected) { // If use isConnected to determine whether need reinitialize zookeeper session, program will not skip this and to execute some operation that need zookeeper be connected. zooKeeper.close() info(s"Initializing a new session to $connectString.") // retry forever until ZooKeeper can be instantiated var connected = false while (!connected) { try { zooKeeper = new ZooKeeper(connectString, sessionTimeoutMs, ZooKeeperClientWatcher) connected = true } catch { case e: Exception => info("Error when recreating ZooKeeper, retrying after a short sleep", e) Thread.sleep(1000) } } } } stateChangeHandlers.values.foreach(callAfterInitializingSession _) }{code} Zookeeper Code: {code:java} //代码占位符 @InterfaceAudience.Public public enum States { CONNECTING, ASSOCIATING, CONNECTED, CONNECTEDREADONLY, CLOSED, AUTH_FAILED, NOT_CONNECTED; public boolean isAlive() { return this != CLOSED && this != AUTH_FAILED; } /** * Returns whether we are connected to a server (which * could possibly be read-only, if this client is allowed * to go to read-only mode) * */ public boolean isConnected() { return this == CONNECTED || this == CONNECTEDREADONLY; } } {code} I hope you pay attention on it. was (Author: lbdai3190): [~junrao], [~yuzhih...@gmail.com] Thank you for you reply, 1. I have tested the Exception in handleRequest(), and as you a said , excpetions throws in handleRequest method will not block on countDownLatch.await(). 2. I recheck the jstack log in attachment, and there is my analysis result as attachment controllerDeadLockAnalysis-2020-01-20.png, {code:java} //代码占位符 private def reinitialize(): Unit = { // Initialization callbacks are invoked outside of the lock to avoid deadlock potential since their completion // may require additional Zookeeper requests, which will block to acquire the initialization lock stateChangeHandlers.values.foreach(callBeforeInitializingSession _) inWriteLock(initializationLock) { // if (!connectionState.isAlive) { // 此处判断可能不准确 if (!connectionState.isConnected) { // If use isConnected to determine whether need reinitialize zookeeper session, program will not skip this and to execute some operation that need zookeeper be connected. zooKeeper.close() info(s"Initializing a new session to $connectString.") // retry forever until ZooKeeper can be instantiated var connected = false while (!connected) { try { zooKeeper = new ZooKeeper(connectString, sessionTimeoutMs, ZooKeeperClientWatcher) connected = true } catch { case e: Exception => info("Error when recreating ZooKeeper, retrying after a short sleep", e) Thread.sleep(1000) } } } } stateChangeHandlers.values.foreach(callAfterInitializingSession _) }{code} Zookeeper Code: {code:java} //代码占位符 @InterfaceAudience.Public public enum States { CONNECTING, ASSOCIATING, CONNECTED, CONNECTEDREADONLY, CLOSED, AUTH_FAILED, NOT_CONNECTED; public boolean isAlive() { return this != CLOSED && this != AUTH_FAILED; // 此处判断不准确 } /** * Returns whether we are connected to a server (which * could possibly be read-only, if this client is allowed * to go to read-only mode) * */ public boolean isConnected() { return this == CONNECTED || this == CONNECTEDREADONLY; } } {code} I hope you pay attention on it. > controller-event-thread deadlock with zk-session-expiry-handler0 > > > Key: KAFKA-8532 > URL: https://issues.apache.org/ji
[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019294#comment-17019294 ] leibo edited comment on KAFKA-8532 at 1/21/20 1:43 AM: --- [~junrao], [~yuzhih...@gmail.com] Thank you for you reply, 1. I have tested the Exception in handleRequest(), and as you a said , excpetions throws in handleRequest method will not block on countDownLatch.await(). 2. I recheck the jstack log in attachment, and there is my analysis result as attachment controllerDeadLockAnalysis-2020-01-20.png, {code:java} //代码占位符 private def reinitialize(): Unit = { // Initialization callbacks are invoked outside of the lock to avoid deadlock potential since their completion // may require additional Zookeeper requests, which will block to acquire the initialization lock stateChangeHandlers.values.foreach(callBeforeInitializingSession _) inWriteLock(initializationLock) { // if (!connectionState.isAlive) { // 此处判断可能不准确 if (!connectionState.isConnected) { // If use isConnected to determine whether need reinitialize zookeeper session, program will not skip this and to execute some operation that need zookeeper be connected. zooKeeper.close() info(s"Initializing a new session to $connectString.") // retry forever until ZooKeeper can be instantiated var connected = false while (!connected) { try { zooKeeper = new ZooKeeper(connectString, sessionTimeoutMs, ZooKeeperClientWatcher) connected = true } catch { case e: Exception => info("Error when recreating ZooKeeper, retrying after a short sleep", e) Thread.sleep(1000) } } } } stateChangeHandlers.values.foreach(callAfterInitializingSession _) }{code} Zookeeper Code: {code:java} //代码占位符 @InterfaceAudience.Public public enum States { CONNECTING, ASSOCIATING, CONNECTED, CONNECTEDREADONLY, CLOSED, AUTH_FAILED, NOT_CONNECTED; public boolean isAlive() { return this != CLOSED && this != AUTH_FAILED; // 此处判断不准确 } /** * Returns whether we are connected to a server (which * could possibly be read-only, if this client is allowed * to go to read-only mode) * */ public boolean isConnected() { return this == CONNECTED || this == CONNECTEDREADONLY; } } {code} I hope you pay attention on it. was (Author: lbdai3190): [~junrao], [~yuzhih...@gmail.com] Thank you for you reply, 1. I have tested the Exception in handleRequest(), and as you a said , excpetions throws in handleRequest method will not block on countDownLatch.await(). 2. I recheck the jstack log in attachment, and there is my analysis result as attachment controllerDeadLockAnalysis-2020-01-20.png, I hope you pay attention on it. > controller-event-thread deadlock with zk-session-expiry-handler0 > > > Key: KAFKA-8532 > URL: https://issues.apache.org/jira/browse/KAFKA-8532 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Blocker > Attachments: controllerDeadLockAnalysis-2020-01-20.png, js.log, > js0.log, js1.log, js2.log > > > We have observed a serious deadlock between controller-event-thead and > zk-session-expirey-handle thread. When this issue occurred, it's only one way > to recovery the kafka cluster is restart kafka server. The follows is the > jstack log of controller-event-thead and zk-session-expiry-handle thread. > "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 > tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005ee3f7000> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // > 等待controller-event-thread线程处理expireEvent > at > kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533) > at > kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173) > at > kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408) > at > kafka.zookeeper.Z
[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019294#comment-17019294 ] leibo edited comment on KAFKA-8532 at 1/20/20 8:28 AM: --- [~junrao], [~yuzhih...@gmail.com] Thank you for you reply, 1. I have tested the Exception in handleRequest(), and as you a said , excpetions throws in handleRequest method will not block on countDownLatch.await(). 2. I recheck the jstack log in attachment, and there is my analysis result as attachment controllerDeadLockAnalysis-2020-01-20.png, I hope you pay attention on it. was (Author: lbdai3190): [~junrao] Thank you for you reply, 1. I have tested the Exception in handleRequest(), and as you a said , excpetions throws in handleRequest method will not block on countDownLatch.await(). 2. I recheck the jstack log in attachment, and there is my analysis result as attachment controllerDeadLockAnalysis-2020-01-20.png, I hope you pay attention on it. > controller-event-thread deadlock with zk-session-expiry-handler0 > > > Key: KAFKA-8532 > URL: https://issues.apache.org/jira/browse/KAFKA-8532 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Blocker > Attachments: controllerDeadLockAnalysis-2020-01-20.png, js.log, > js0.log, js1.log, js2.log > > > We have observed a serious deadlock between controller-event-thead and > zk-session-expirey-handle thread. When this issue occurred, it's only one way > to recovery the kafka cluster is restart kafka server. The follows is the > jstack log of controller-event-thead and zk-session-expiry-handle thread. > "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 > tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005ee3f7000> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // > 等待controller-event-thread线程处理expireEvent > at > kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533) > at > kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173) > at > kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374) > at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown > Source) > at scala.collection.Iterator.foreach(Iterator.scala:937) > at scala.collection.Iterator.foreach$(Iterator.scala:937) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1425) > at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209) > at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428) > at > kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown > Source) > at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114) > at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown > Source) > at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Locked ownable synchronizers: > - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker) > "controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 > nid=0x310 waiting on condition [0x7fccb55c8000]
[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019294#comment-17019294 ] leibo edited comment on KAFKA-8532 at 1/20/20 8:26 AM: --- [~junrao] Thank you for you reply, 1. I have tested the Exception in handleRequest(), and as you a said , excpetions throws in handleRequest method will not block on countDownLatch.await(). 2. I recheck the jstack log in attachment, and there is my analysis result as attachment controllerDeadLockAnalysis-2020-01-20.png, I hope you pay attention on it. was (Author: lbdai3190): [~junrao] Thank you for you reply, 1. I have tested the Exception in handleRequest(), and as you a said , excpetions throws in handleRequest will not block on countDownLatch.await(). 2. I recheck the jstack log in attachment, and there is my analysis result as attachment controllerDeadLockAnalysis-2020-01-20.png, > controller-event-thread deadlock with zk-session-expiry-handler0 > > > Key: KAFKA-8532 > URL: https://issues.apache.org/jira/browse/KAFKA-8532 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Blocker > Attachments: controllerDeadLockAnalysis-2020-01-20.png, js.log, > js0.log, js1.log, js2.log > > > We have observed a serious deadlock between controller-event-thead and > zk-session-expirey-handle thread. When this issue occurred, it's only one way > to recovery the kafka cluster is restart kafka server. The follows is the > jstack log of controller-event-thead and zk-session-expiry-handle thread. > "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 > tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005ee3f7000> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // > 等待controller-event-thread线程处理expireEvent > at > kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533) > at > kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173) > at > kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374) > at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown > Source) > at scala.collection.Iterator.foreach(Iterator.scala:937) > at scala.collection.Iterator.foreach$(Iterator.scala:937) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1425) > at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209) > at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428) > at > kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown > Source) > at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114) > at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown > Source) > at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Locked ownable synchronizers: > - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker) > "controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 > nid=0x310 waiting on condition [0x7fccb55c8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe
[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019294#comment-17019294 ] leibo edited comment on KAFKA-8532 at 1/20/20 8:21 AM: --- [~junrao] Thank you for you reply, 1. I have tested the Exception in handleRequest(), and as you a said , excpetions throws in handleRequest will not block on countDownLatch.await(). 2. I recheck the jstack log in attachment, and there is my analysis result as attachment controllerDeadLockAnalysis-2020-01-20.png, was (Author: lbdai3190): [~junrao] Thank you for you reply, 1. I have tested the Exception in handleRequest(), and as you a said , excpetions throws in handleRequest will not block on countDownLatch.await(). 2. I recheck the jstack log in attachment, and there is my analysis result as attachment controllerDeadLockAnalysis2020-01-20.png, !controllerDeadLockAnalysis-2020-01-20.png! > controller-event-thread deadlock with zk-session-expiry-handler0 > > > Key: KAFKA-8532 > URL: https://issues.apache.org/jira/browse/KAFKA-8532 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Blocker > Attachments: controllerDeadLockAnalysis-2020-01-20.png, js.log, > js0.log, js1.log, js2.log > > > We have observed a serious deadlock between controller-event-thead and > zk-session-expirey-handle thread. When this issue occurred, it's only one way > to recovery the kafka cluster is restart kafka server. The follows is the > jstack log of controller-event-thead and zk-session-expiry-handle thread. > "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 > tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005ee3f7000> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // > 等待controller-event-thread线程处理expireEvent > at > kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533) > at > kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173) > at > kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374) > at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown > Source) > at scala.collection.Iterator.foreach(Iterator.scala:937) > at scala.collection.Iterator.foreach$(Iterator.scala:937) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1425) > at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209) > at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428) > at > kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown > Source) > at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114) > at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown > Source) > at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Locked ownable synchronizers: > - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker) > "controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 > nid=0x310 waiting on condition [0x7fccb55c8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Uns
[jira] [Updated] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leibo updated KAFKA-8532: - Attachment: controllerDeadLockAnalysis-2020-01-20.png > controller-event-thread deadlock with zk-session-expiry-handler0 > > > Key: KAFKA-8532 > URL: https://issues.apache.org/jira/browse/KAFKA-8532 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Blocker > Attachments: controllerDeadLockAnalysis-2020-01-20.png, js.log, > js0.log, js1.log, js2.log > > > We have observed a serious deadlock between controller-event-thead and > zk-session-expirey-handle thread. When this issue occurred, it's only one way > to recovery the kafka cluster is restart kafka server. The follows is the > jstack log of controller-event-thead and zk-session-expiry-handle thread. > "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 > tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005ee3f7000> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // > 等待controller-event-thread线程处理expireEvent > at > kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533) > at > kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173) > at > kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374) > at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown > Source) > at scala.collection.Iterator.foreach(Iterator.scala:937) > at scala.collection.Iterator.foreach$(Iterator.scala:937) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1425) > at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209) > at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428) > at > kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown > Source) > at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114) > at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown > Source) > at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Locked ownable synchronizers: > - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker) > "controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 > nid=0x310 waiting on condition [0x7fccb55c8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005d1be5a00> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) > at kafka.zookeeper.ZooKeeperClient.handleRequests(ZooKeeperClient.scala:157) > at > kafka.zk.KafkaZkClient.retryReq
[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019294#comment-17019294 ] leibo edited comment on KAFKA-8532 at 1/20/20 8:20 AM: --- [~junrao] Thank you for you reply, 1. I have tested the Exception in handleRequest(), and as you a said , excpetions throws in handleRequest will not block on countDownLatch.await(). 2. I recheck the jstack log in attachment, and there is my analysis result as attachment controllerDeadLockAnalysis2020-01-20.png, !controllerDeadLockAnalysis-2020-01-20.png! was (Author: lbdai3190): [~junrao] Thank you for you reply, 1. I have tested the Exception in handleRequest(), and as you a said , excpetions throws in handleRequest will not block on countDownLatch.await(). 2. I recheck the jstack log in attachment, and there is my analysis result as attachment controllerDeadLockAnalysis2020-01-20.png, !controllerDeadLockAnalysis2020-01-20.png! > controller-event-thread deadlock with zk-session-expiry-handler0 > > > Key: KAFKA-8532 > URL: https://issues.apache.org/jira/browse/KAFKA-8532 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Blocker > Attachments: controllerDeadLockAnalysis-2020-01-20.png, js.log, > js0.log, js1.log, js2.log > > > We have observed a serious deadlock between controller-event-thead and > zk-session-expirey-handle thread. When this issue occurred, it's only one way > to recovery the kafka cluster is restart kafka server. The follows is the > jstack log of controller-event-thead and zk-session-expiry-handle thread. > "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 > tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005ee3f7000> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // > 等待controller-event-thread线程处理expireEvent > at > kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533) > at > kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173) > at > kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374) > at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown > Source) > at scala.collection.Iterator.foreach(Iterator.scala:937) > at scala.collection.Iterator.foreach$(Iterator.scala:937) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1425) > at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209) > at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428) > at > kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown > Source) > at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114) > at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown > Source) > at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Locked ownable synchronizers: > - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker) > "controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 > nid=0x310 waiting on condition [0x7fccb55c8000] > java.lang.Thread.S
[jira] [Updated] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leibo updated KAFKA-8532: - Attachment: (was: controllerDeadLockAnalysis2020-01-20.png) > controller-event-thread deadlock with zk-session-expiry-handler0 > > > Key: KAFKA-8532 > URL: https://issues.apache.org/jira/browse/KAFKA-8532 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Blocker > Attachments: js.log, js0.log, js1.log, js2.log > > > We have observed a serious deadlock between controller-event-thead and > zk-session-expirey-handle thread. When this issue occurred, it's only one way > to recovery the kafka cluster is restart kafka server. The follows is the > jstack log of controller-event-thead and zk-session-expiry-handle thread. > "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 > tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005ee3f7000> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // > 等待controller-event-thread线程处理expireEvent > at > kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533) > at > kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173) > at > kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374) > at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown > Source) > at scala.collection.Iterator.foreach(Iterator.scala:937) > at scala.collection.Iterator.foreach$(Iterator.scala:937) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1425) > at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209) > at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428) > at > kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown > Source) > at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114) > at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown > Source) > at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Locked ownable synchronizers: > - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker) > "controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 > nid=0x310 waiting on condition [0x7fccb55c8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005d1be5a00> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) > at kafka.zookeeper.ZooKeeperClient.handleRequests(ZooKeeperClient.scala:157) > at > kafka.zk.KafkaZkClient.retryRequestsUntilConnected(KafkaZkClient.sc
[jira] [Commented] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019294#comment-17019294 ] leibo commented on KAFKA-8532: -- [~junrao] Thank you for you reply, 1. I have tested the Exception in handleRequest(), and as you a said , excpetions throws in handleRequest will not block on countDownLatch.await(). 2. I recheck the jstack log in attachment, and there is my analysis result as attachment controllerDeadLockAnalysis2020-01-20.png, !controllerDeadLockAnalysis2020-01-20.png! > controller-event-thread deadlock with zk-session-expiry-handler0 > > > Key: KAFKA-8532 > URL: https://issues.apache.org/jira/browse/KAFKA-8532 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Blocker > Attachments: controllerDeadLockAnalysis2020-01-20.png, js.log, > js0.log, js1.log, js2.log > > > We have observed a serious deadlock between controller-event-thead and > zk-session-expirey-handle thread. When this issue occurred, it's only one way > to recovery the kafka cluster is restart kafka server. The follows is the > jstack log of controller-event-thead and zk-session-expiry-handle thread. > "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 > tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005ee3f7000> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // > 等待controller-event-thread线程处理expireEvent > at > kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533) > at > kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173) > at > kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374) > at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown > Source) > at scala.collection.Iterator.foreach(Iterator.scala:937) > at scala.collection.Iterator.foreach$(Iterator.scala:937) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1425) > at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209) > at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428) > at > kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown > Source) > at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114) > at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown > Source) > at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Locked ownable synchronizers: > - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker) > "controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 > nid=0x310 waiting on condition [0x7fccb55c8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005d1be5a00> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterrupti
[jira] [Updated] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leibo updated KAFKA-8532: - Attachment: controllerDeadLockAnalysis2020-01-20.png > controller-event-thread deadlock with zk-session-expiry-handler0 > > > Key: KAFKA-8532 > URL: https://issues.apache.org/jira/browse/KAFKA-8532 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Blocker > Attachments: controllerDeadLockAnalysis2020-01-20.png, js.log, > js0.log, js1.log, js2.log > > > We have observed a serious deadlock between controller-event-thead and > zk-session-expirey-handle thread. When this issue occurred, it's only one way > to recovery the kafka cluster is restart kafka server. The follows is the > jstack log of controller-event-thead and zk-session-expiry-handle thread. > "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 > tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005ee3f7000> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // > 等待controller-event-thread线程处理expireEvent > at > kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533) > at > kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173) > at > kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374) > at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown > Source) > at scala.collection.Iterator.foreach(Iterator.scala:937) > at scala.collection.Iterator.foreach$(Iterator.scala:937) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1425) > at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209) > at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428) > at > kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown > Source) > at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114) > at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown > Source) > at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Locked ownable synchronizers: > - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker) > "controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 > nid=0x310 waiting on condition [0x7fccb55c8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005d1be5a00> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) > at kafka.zookeeper.ZooKeeperClient.handleRequests(ZooKeeperClient.scala:157) > at > kafka.zk.KafkaZkClient.retryReque
[jira] [Commented] (KAFKA-6879) Controller deadlock following session expiration
[ https://issues.apache.org/jira/browse/KAFKA-6879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17017827#comment-17017827 ] leibo commented on KAFKA-6879: -- Hello [~hachikuji], I have met this issue on kafka 2.1.1 many times , and the description is here https://issues.apache.org/jira/browse/KAFKA-8532 So I think the controller deadlock problem is not solved completely. > Controller deadlock following session expiration > > > Key: KAFKA-6879 > URL: https://issues.apache.org/jira/browse/KAFKA-6879 > Project: Kafka > Issue Type: Bug > Components: controller >Affects Versions: 1.1.0 >Reporter: Jason Gustafson >Assignee: Jason Gustafson >Priority: Critical > Fix For: 1.1.1, 2.0.0 > > > We have observed an apparent deadlock situation which occurs following a > session expiration. The suspected deadlock occurs between the zookeeper > "initializationLock" and the latch inside the Expire event which we use to > ensure all events have been handled. > In the logs, we see the "Session expired" message following acquisition of > the initialization lock: > https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/zookeeper/ZooKeeperClient.scala#L358 > But we never see any logs indicating that the new session is being > initialized. In fact, the controller logs are basically empty from that point > on. The problem we suspect is that completion of the > {{beforeInitializingSession}} callback requires that all events have finished > processing in order to count down the latch: > https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/controller/KafkaController.scala#L1525. > But an event which was dequeued just prior to the acquisition of the write > lock may be unable to complete because it is awaiting acquisition of the > initialization lock: > https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/zookeeper/ZooKeeperClient.scala#L137. > The impact is that the broker continues in a zombie state. It continues > fetching and is periodically added to ISRs, but it never receives any further > requests from the controller since it is not registered. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leibo updated KAFKA-8532: - Description: We have observed a serious deadlock between controller-event-thead and zk-session-expirey-handle thread. When this issue occurred, it's only one way to recovery the kafka cluster is restart kafka server. The follows is the jstack log of controller-event-thead and zk-session-expiry-handle thread. "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0005ee3f7000> (a java.util.concurrent.CountDownLatch$Sync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // 等待controller-event-thread线程处理expireEvent at kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533) at kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173) at kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408) at kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374) at kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374) at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown Source) at scala.collection.Iterator.foreach(Iterator.scala:937) at scala.collection.Iterator.foreach$(Iterator.scala:937) at scala.collection.AbstractIterator.foreach(Iterator.scala:1425) at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209) at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374) at kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428) at kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown Source) at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114) at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown Source) at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Locked ownable synchronizers: - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker) "controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 nid=0x310 waiting on condition [0x7fccb55c8000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0005d1be5a00> (a java.util.concurrent.CountDownLatch$Sync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) at kafka.zookeeper.ZooKeeperClient.handleRequests(ZooKeeperClient.scala:157) at kafka.zk.KafkaZkClient.retryRequestsUntilConnected(KafkaZkClient.scala:1596) at kafka.zk.KafkaZkClient.kafka$zk$KafkaZkClient$$retryRequestUntilConnected(KafkaZkClient.scala:1589) at kafka.zk.KafkaZkClient.deletePreferredReplicaElection(KafkaZkClient.scala:989) at kafka.controller.KafkaController.removePartitionsFromPreferredReplicaElection(KafkaController.scala:873) at kafka.controller.KafkaController.kafka$controller$KafkaController$$onPreferredReplicaElection(KafkaController.scala:631) at kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:266) at kafka.controller.KafkaController.kafka$controller$KafkaController$$elect(KafkaController.scala:1221) at kafka.controller.KafkaController$Reelect$.pr
[jira] [Updated] (KAFKA-9366) please consider upgrade log4j to log4j2 due to critical security problem CVE-2019-17571
[ https://issues.apache.org/jira/browse/KAFKA-9366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leibo updated KAFKA-9366: - Description: h2. CVE-2019-17571 Detail Included in Log4j 1.2 is a SocketServer class that is vulnerable to deserialization of untrusted data which can be exploited to remotely execute arbitrary code when combined with a deserialization gadget when listening to untrusted network traffic for log data. This affects Log4j versions up to 1.2 up to 1.2.17. [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-17571] was: h2. CVE-2019-17571 Detail Included in Log4j 1.2 is a SocketServer class that is vulnerable to deserialization of untrusted data which can be exploited to remotely execute arbitrary code when combined with a deserialization gadget when listening to untrusted network traffic for log data. This affects Log4j versions up to 1.2 up to 1.2.17. > please consider upgrade log4j to log4j2 due to critical security problem > CVE-2019-17571 > --- > > Key: KAFKA-9366 > URL: https://issues.apache.org/jira/browse/KAFKA-9366 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.2.0, 2.1.1, 2.3.0, 2.4.0 >Reporter: leibo >Priority: Critical > > h2. CVE-2019-17571 Detail > Included in Log4j 1.2 is a SocketServer class that is vulnerable to > deserialization of untrusted data which can be exploited to remotely execute > arbitrary code when combined with a deserialization gadget when listening to > untrusted network traffic for log data. This affects Log4j versions up to 1.2 > up to 1.2.17. > > [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-17571] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-9366) please consider upgrade log4j to log4j2 due to critical security problem CVE-2019-17571
leibo created KAFKA-9366: Summary: please consider upgrade log4j to log4j2 due to critical security problem CVE-2019-17571 Key: KAFKA-9366 URL: https://issues.apache.org/jira/browse/KAFKA-9366 Project: Kafka Issue Type: Bug Components: core Affects Versions: 2.4.0, 2.3.0, 2.1.1, 2.2.0 Reporter: leibo h2. CVE-2019-17571 Detail Included in Log4j 1.2 is a SocketServer class that is vulnerable to deserialization of untrusted data which can be exploited to remotely execute arbitrary code when combined with a deserialization gadget when listening to untrusted network traffic for log data. This affects Log4j versions up to 1.2 up to 1.2.17. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leibo updated KAFKA-8532: - Affects Version/s: (was: 2.3.0) > controller-event-thread deadlock with zk-session-expiry-handler0 > > > Key: KAFKA-8532 > URL: https://issues.apache.org/jira/browse/KAFKA-8532 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Blocker > Attachments: js.log, js0.log, js1.log, js2.log > > > We have observed a serious deadlock between controller-event-thead and > zk-session-expirey-handle thread. When this issue occurred, it's only one way > to recovery the kafka cluster is restart kafka server. The follows is the > jstack log of controller-event-thead and zk-session-expiry-handle thread. > "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 > tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005ee3f7000> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // > 等待controller-event-thread线程处理expireEvent > at > kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533) > at > kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173) > at > kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374) > at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown > Source) > at scala.collection.Iterator.foreach(Iterator.scala:937) > at scala.collection.Iterator.foreach$(Iterator.scala:937) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1425) > at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209) > at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428) > at > kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown > Source) > at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114) > at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown > Source) > at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Locked ownable synchronizers: > - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker) > "controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 > nid=0x310 waiting on condition [0x7fccb55c8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005d1be5a00> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) > at kafka.zookeeper.ZooKeeperClient.handleRequests(ZooKeeperClient.scala:157) > at > kafka.zk.KafkaZkClient.retryRequestsUntilConnected(KafkaZkClient.scala:1596) > at > kafka.zk.
[jira] [Updated] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leibo updated KAFKA-8532: - Affects Version/s: 2.3.0 > controller-event-thread deadlock with zk-session-expiry-handler0 > > > Key: KAFKA-8532 > URL: https://issues.apache.org/jira/browse/KAFKA-8532 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1, 2.3.0 >Reporter: leibo >Priority: Blocker > Attachments: js.log, js0.log, js1.log, js2.log > > > We have observed a serious deadlock between controller-event-thead and > zk-session-expirey-handle thread. When this issue occurred, it's only one way > to recovery the kafka cluster is restart kafka server. The follows is the > jstack log of controller-event-thead and zk-session-expiry-handle thread. > "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 > tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005ee3f7000> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // > 等待controller-event-thread线程处理expireEvent > at > kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533) > at > kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173) > at > kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374) > at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown > Source) > at scala.collection.Iterator.foreach(Iterator.scala:937) > at scala.collection.Iterator.foreach$(Iterator.scala:937) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1425) > at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209) > at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428) > at > kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown > Source) > at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114) > at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown > Source) > at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Locked ownable synchronizers: > - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker) > "controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 > nid=0x310 waiting on condition [0x7fccb55c8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005d1be5a00> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) > at kafka.zookeeper.ZooKeeperClient.handleRequests(ZooKeeperClient.scala:157) > at > kafka.zk.KafkaZkClient.retryRequestsUntilConnected(KafkaZkClient.scala:1596) > at > kafka.zk.Kafk
[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932042#comment-16932042 ] leibo edited comment on KAFKA-8532 at 9/18/19 4:27 AM: --- [~junrao], [~ijuma] My analysis: When kafka disconnected with zookeeper, Controller-event-thread is handing ISRChangeNotification event, because kafka is disconnected with zk, controller event handle is blocked on retryRequestUntilConnected -> handleRequest method. At this time, zk-session-expired-handle-thread is try to reinitialize a new zk connection with zookeeper, it put a Controller event named Expire to ControllerEventManager LinkedBlockingQueue queue, due to controller-event-thread is blocked on ISRChangeNotification event, the Expire event will not be handle by ControllerEventThread. The problem is obviously. ISRChangeNotification and Expire are block each other. The problem is here: {code:java} //代码占位符 def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] = { if (requests.isEmpty) Seq.empty else { val countDownLatch = new CountDownLatch(requests.size) val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size) requests.foreach { request => inFlightRequests.acquire() try { inReadLock(initializationLock) { send(request) { response => responseQueue.add(response) inFlightRequests.release() countDownLatch.countDown() } } } catch { case e: Throwable => inFlightRequests.release() throw e } } countDownLatch.await() responseQueue.asScala.toBuffer } } {code} we can see, if exception is occurred, countDownLatch.countDown() will not be execute, and this method will always blocked here. was (Author: lbdai3190): [~junrao], [~ijuma] My analysis: When kafka disconnected with zookeeper, Controller-event-thread is handing ISRChangeNotification event, because kafka is disconnected with zk, controller event handle is blocked on retryRequestUntilConnected -> handleRequest method. At this time, zk-session-expired-handle-thread is try to reinitialize a new zk connection with zookeeper, it put a Controller event named Expire to ControllerEventManager LinkedBlockingQueue queue, due to controller-event-thread is blocked on ISRChangeNotification event, the Expire event will not be handle. The problem is obviously. ISRChangeNotification and Expire are block each other. The problem is here: {code:java} //代码占位符 def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] = { if (requests.isEmpty) Seq.empty else { val countDownLatch = new CountDownLatch(requests.size) val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size) requests.foreach { request => inFlightRequests.acquire() try { inReadLock(initializationLock) { send(request) { response => responseQueue.add(response) inFlightRequests.release() countDownLatch.countDown() } } } catch { case e: Throwable => inFlightRequests.release() throw e } } countDownLatch.await() responseQueue.asScala.toBuffer } } {code} we can see, if exception is occurred, countDownLatch.countDown() will not be execute, and this method will always blocked here. > controller-event-thread deadlock with zk-session-expiry-handler0 > > > Key: KAFKA-8532 > URL: https://issues.apache.org/jira/browse/KAFKA-8532 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Blocker > Attachments: js.log, js0.log, js1.log, js2.log > > > We have observed a serious deadlock between controller-event-thead and > zk-session-expirey-handle thread. When this issue occurred, it's only one way > to recovery the kafka cluster is restart kafka server. The follows is the > jstack log of controller-event-thead and zk-session-expiry-handle thread. > "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 > tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005ee3f7000> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932042#comment-16932042 ] leibo edited comment on KAFKA-8532 at 9/18/19 4:26 AM: --- [~junrao], [~ijuma] My analysis: When kafka disconnected with zookeeper, Controller-event-thread is handing ISRChangeNotification event, because kafka is disconnected with zk, controller event handle is blocked on retryRequestUntilConnected -> handleRequest method. At this time, zk-session-expired-handle-thread is try to reinitialize a new zk connection with zookeeper, it put a Controller event named Expire to ControllerEventManager LinkedBlockingQueue queue, due to controller-event-thread is blocked on ISRChangeNotification event, the Expire event will not be handle. The problem is obviously. ISRChangeNotification and Expire are block each other. The problem is here: {code:java} //代码占位符 def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] = { if (requests.isEmpty) Seq.empty else { val countDownLatch = new CountDownLatch(requests.size) val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size) requests.foreach { request => inFlightRequests.acquire() try { inReadLock(initializationLock) { send(request) { response => responseQueue.add(response) inFlightRequests.release() countDownLatch.countDown() } } } catch { case e: Throwable => inFlightRequests.release() throw e } } countDownLatch.await() responseQueue.asScala.toBuffer } } {code} we can see, if exception is occurred, countDownLatch.countDown() will not be execute, and this method will always blocked here. was (Author: lbdai3190): [~junrao], [~ijuma] My analysis: When kafka disconnected with zookeeper, Controller-event-thread is handing ISRChangeNotification event, because kafka is disconnected with zk, controller event handle is blocked on retryRequestUntilConnected -> handleRequest method. At this time, zk-session-expired-handle-thread is try to reinitialize a new zk connection with zookeeper, it put a Controller event named Expire to ControllerEventManager LinkedBlockingQueue queue, due to controller-event-thread is blocked on ISRChangeNotification event, the Expire event will not be handle. The problem is obviously. ISRChangeNotification and Expire are block each other. The problem is here: {code:java} //代码占位符 def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] = { if (requests.isEmpty) Seq.empty else { val countDownLatch = new CountDownLatch(requests.size) val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size) requests.foreach { request => inFlightRequests.acquire() try { inReadLock(initializationLock) { send(request) { response => responseQueue.add(response) inFlightRequests.release() countDownLatch.countDown() } } } catch { case e: Throwable => inFlightRequests.release() throw e } } countDownLatch.await() responseQueue.asScala.toBuffer } } {code} we can see, if exception is occurred, countDownLatch.countDown() will not be execute, and this method will always blocked here. > controller-event-thread deadlock with zk-session-expiry-handler0 > > > Key: KAFKA-8532 > URL: https://issues.apache.org/jira/browse/KAFKA-8532 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Blocker > Attachments: js.log, js0.log, js1.log, js2.log > > > We have observed a serious deadlock between controller-event-thead and > zk-session-expirey-handle thread. When this issue occurred, it's only one way > to recovery the kafka cluster is restart kafka server. The follows is the > jstack log of controller-event-thead and zk-session-expiry-handle thread. > "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 > tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005ee3f7000> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.conc
[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932042#comment-16932042 ] leibo edited comment on KAFKA-8532 at 9/18/19 4:26 AM: --- [~junrao], [~ijuma] My analysis: When kafka disconnected with zookeeper, Controller-event-thread is handing ISRChangeNotification event, because kafka is disconnected with zk, controller event handle is blocked on retryRequestUntilConnected -> handleRequest method. At this time, zk-session-expired-handle-thread is try to reinitialize a new zk connection with zookeeper, it put a Controller event named Expire to ControllerEventManager LinkedBlockingQueue queue, due to controller-event-thread is blocked on ISRChangeNotification event, the Expire event will not be handle. The problem is obviously. ISRChangeNotification and Expire are block each other. The problem is here: {code:java} //代码占位符 def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] = { if (requests.isEmpty) Seq.empty else { val countDownLatch = new CountDownLatch(requests.size) val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size) requests.foreach { request => inFlightRequests.acquire() try { inReadLock(initializationLock) { send(request) { response => responseQueue.add(response) inFlightRequests.release() countDownLatch.countDown() } } } catch { case e: Throwable => inFlightRequests.release() throw e } } countDownLatch.await() responseQueue.asScala.toBuffer } } {code} we can see, if exception is occurred, countDownLatch.countDown() will not be execute, and this method will always blocked here. was (Author: lbdai3190): [~junrao], [~ijuma] My analysis: When kafka disconnected with zookeeper, Controller-event-thread is handing ISRChangeNotification event, because kafka is disconnected with zk, controller event handle is blocked on retryRequestUntilConnected -> handleRequest method. At this time, zk-session-expired-handle thread is try to reinitialize a new zk connection with zookeeper, it put a Controller event named Expire to ControllerEventManager LinkedBlockingQueue queue, due to controller-event-thread is blocked on ISRChangeNotification event, the Expire event will not be handle. The problem is obviously. ISRChangeNotification and Expire are block each other. The problem is here: {code:java} //代码占位符 def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] = { if (requests.isEmpty) Seq.empty else { val countDownLatch = new CountDownLatch(requests.size) val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size) requests.foreach { request => inFlightRequests.acquire() try { inReadLock(initializationLock) { send(request) { response => responseQueue.add(response) inFlightRequests.release() countDownLatch.countDown() } } } catch { case e: Throwable => inFlightRequests.release() throw e } } countDownLatch.await() responseQueue.asScala.toBuffer } } {code} we can see, if exception is occurred, countDownLatch.countDown() will not be execute, and this method will always blocked here. > controller-event-thread deadlock with zk-session-expiry-handler0 > > > Key: KAFKA-8532 > URL: https://issues.apache.org/jira/browse/KAFKA-8532 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Blocker > Attachments: js.log, js0.log, js1.log, js2.log > > > We have observed a serious deadlock between controller-event-thead and > zk-session-expirey-handle thread. When this issue occurred, it's only one way > to recovery the kafka cluster is restart kafka server. The follows is the > jstack log of controller-event-thead and zk-session-expiry-handle thread. > "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 > tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005ee3f7000> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.con
[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932042#comment-16932042 ] leibo edited comment on KAFKA-8532 at 9/18/19 4:24 AM: --- [~junrao] My analysis: When kafka disconnected with zookeeper, Controller-event-thread is handing ISRChangeNotification event, because kafka is disconnected with zk, controller event handle is blocked on retryRequestUntilConnected -> handleRequest method. At this time, zk-session-expired-handle thread is try to reinitialize a new zk connection with zookeeper, it put a Controller event named Expire to ControllerEventManager LinkedBlockingQueue queue, due to controller-event-thread is blocked on ISRChangeNotification event, the Expire event will not be handle. The problem is obviously. ISRChangeNotification and Expire are block each other. The problem is here: {code:java} //代码占位符 def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] = { if (requests.isEmpty) Seq.empty else { val countDownLatch = new CountDownLatch(requests.size) val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size) requests.foreach { request => inFlightRequests.acquire() try { inReadLock(initializationLock) { send(request) { response => responseQueue.add(response) inFlightRequests.release() countDownLatch.countDown() } } } catch { case e: Throwable => inFlightRequests.release() throw e } } countDownLatch.await() responseQueue.asScala.toBuffer } } {code} we can see, if exception is occurred, countDownLatch.countDown() will not be execute, and this method will always blocked here. was (Author: lbdai3190): My analysis: When kafka disconnected with zookeeper, Controller-event-thread is handing ISRChangeNotification event, because kafka is disconnected with zk, controller event handle is blocked on retryRequestUntilConnected -> handleRequest method. At this time, zk-session-expired-handle thread is try to reinitialize a new zk connection with zookeeper, it put a Controller event named Expire to ControllerEventManager LinkedBlockingQueue queue, due to controller-event-thread is blocked on ISRChangeNotification event, the Expire event will not be handle. The problem is obviously. ISRChangeNotification and Expire are block each other. The problem is here: {code:java} //代码占位符 def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] = { if (requests.isEmpty) Seq.empty else { val countDownLatch = new CountDownLatch(requests.size) val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size) requests.foreach { request => inFlightRequests.acquire() try { inReadLock(initializationLock) { send(request) { response => responseQueue.add(response) inFlightRequests.release() countDownLatch.countDown() } } } catch { case e: Throwable => inFlightRequests.release() throw e } } countDownLatch.await() responseQueue.asScala.toBuffer } } {code} we can see, if exception is occurred, countDownLatch.countDown() will not be execute, and this method will always blocked here. > controller-event-thread deadlock with zk-session-expiry-handler0 > > > Key: KAFKA-8532 > URL: https://issues.apache.org/jira/browse/KAFKA-8532 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Blocker > Attachments: js.log, js0.log, js1.log, js2.log > > > We have observed a serious deadlock between controller-event-thead and > zk-session-expirey-handle thread. When this issue occurred, it's only one way > to recovery the kafka cluster is restart kafka server. The follows is the > jstack log of controller-event-thead and zk-session-expiry-handle thread. > "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 > tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005ee3f7000> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSyn
[jira] [Commented] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932042#comment-16932042 ] leibo commented on KAFKA-8532: -- My analysis: When kafka disconnected with zookeeper, Controller-event-thread is handing ISRChangeNotification event, because kafka is disconnected with zk, controller event handle is blocked on retryRequestUntilConnected -> handleRequest method. At this time, zk-session-expired-handle thread is try to reinitialize a new zk connection with zookeeper, it put a Controller event named Expire to ControllerEventManager LinkedBlockingQueue queue, due to controller-event-thread is blocked on ISRChangeNotification event, the Expire event will not be handle. The problem is obviously. ISRChangeNotification and Expire are block each other. The problem is here: {code:java} //代码占位符 def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] = { if (requests.isEmpty) Seq.empty else { val countDownLatch = new CountDownLatch(requests.size) val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size) requests.foreach { request => inFlightRequests.acquire() try { inReadLock(initializationLock) { send(request) { response => responseQueue.add(response) inFlightRequests.release() countDownLatch.countDown() } } } catch { case e: Throwable => inFlightRequests.release() throw e } } countDownLatch.await() responseQueue.asScala.toBuffer } } {code} we can see, if exception is occurred, countDownLatch.countDown() will not be execute, and this method will always blocked here. > controller-event-thread deadlock with zk-session-expiry-handler0 > > > Key: KAFKA-8532 > URL: https://issues.apache.org/jira/browse/KAFKA-8532 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Blocker > Attachments: js.log, js0.log, js1.log, js2.log > > > We have observed a serious deadlock between controller-event-thead and > zk-session-expirey-handle thread. When this issue occurred, it's only one way > to recovery the kafka cluster is restart kafka server. The follows is the > jstack log of controller-event-thead and zk-session-expiry-handle thread. > "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 > tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005ee3f7000> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // > 等待controller-event-thread线程处理expireEvent > at > kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533) > at > kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173) > at > kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374) > at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown > Source) > at scala.collection.Iterator.foreach(Iterator.scala:937) > at scala.collection.Iterator.foreach$(Iterator.scala:937) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1425) > at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209) > at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428) > at > kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown > Source) > at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114) > at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown > Source) > at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(Fut
[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932042#comment-16932042 ] leibo edited comment on KAFKA-8532 at 9/18/19 4:25 AM: --- [~junrao], [~ijuma] My analysis: When kafka disconnected with zookeeper, Controller-event-thread is handing ISRChangeNotification event, because kafka is disconnected with zk, controller event handle is blocked on retryRequestUntilConnected -> handleRequest method. At this time, zk-session-expired-handle thread is try to reinitialize a new zk connection with zookeeper, it put a Controller event named Expire to ControllerEventManager LinkedBlockingQueue queue, due to controller-event-thread is blocked on ISRChangeNotification event, the Expire event will not be handle. The problem is obviously. ISRChangeNotification and Expire are block each other. The problem is here: {code:java} //代码占位符 def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] = { if (requests.isEmpty) Seq.empty else { val countDownLatch = new CountDownLatch(requests.size) val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size) requests.foreach { request => inFlightRequests.acquire() try { inReadLock(initializationLock) { send(request) { response => responseQueue.add(response) inFlightRequests.release() countDownLatch.countDown() } } } catch { case e: Throwable => inFlightRequests.release() throw e } } countDownLatch.await() responseQueue.asScala.toBuffer } } {code} we can see, if exception is occurred, countDownLatch.countDown() will not be execute, and this method will always blocked here. was (Author: lbdai3190): [~junrao] My analysis: When kafka disconnected with zookeeper, Controller-event-thread is handing ISRChangeNotification event, because kafka is disconnected with zk, controller event handle is blocked on retryRequestUntilConnected -> handleRequest method. At this time, zk-session-expired-handle thread is try to reinitialize a new zk connection with zookeeper, it put a Controller event named Expire to ControllerEventManager LinkedBlockingQueue queue, due to controller-event-thread is blocked on ISRChangeNotification event, the Expire event will not be handle. The problem is obviously. ISRChangeNotification and Expire are block each other. The problem is here: {code:java} //代码占位符 def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] = { if (requests.isEmpty) Seq.empty else { val countDownLatch = new CountDownLatch(requests.size) val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size) requests.foreach { request => inFlightRequests.acquire() try { inReadLock(initializationLock) { send(request) { response => responseQueue.add(response) inFlightRequests.release() countDownLatch.countDown() } } } catch { case e: Throwable => inFlightRequests.release() throw e } } countDownLatch.await() responseQueue.asScala.toBuffer } } {code} we can see, if exception is occurred, countDownLatch.countDown() will not be execute, and this method will always blocked here. > controller-event-thread deadlock with zk-session-expiry-handler0 > > > Key: KAFKA-8532 > URL: https://issues.apache.org/jira/browse/KAFKA-8532 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Blocker > Attachments: js.log, js0.log, js1.log, js2.log > > > We have observed a serious deadlock between controller-event-thead and > zk-session-expirey-handle thread. When this issue occurred, it's only one way > to recovery the kafka cluster is restart kafka server. The follows is the > jstack log of controller-event-thead and zk-session-expiry-handle thread. > "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 > tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005ee3f7000> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.lo
[jira] [Commented] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932006#comment-16932006 ] leibo commented on KAFKA-8532: -- Hi, [~junrao] , the problem is happened again, the cluster jstack dump log were uploaded . please help to analysis this problem, thx. > controller-event-thread deadlock with zk-session-expiry-handler0 > > > Key: KAFKA-8532 > URL: https://issues.apache.org/jira/browse/KAFKA-8532 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Blocker > Attachments: js.log, js0.log, js1.log, js2.log > > > We have observed a serious deadlock between controller-event-thead and > zk-session-expirey-handle thread. When this issue occurred, it's only one way > to recovery the kafka cluster is restart kafka server. The follows is the > jstack log of controller-event-thead and zk-session-expiry-handle thread. > "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 > tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005ee3f7000> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // > 等待controller-event-thread线程处理expireEvent > at > kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533) > at > kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173) > at > kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374) > at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown > Source) > at scala.collection.Iterator.foreach(Iterator.scala:937) > at scala.collection.Iterator.foreach$(Iterator.scala:937) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1425) > at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209) > at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428) > at > kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown > Source) > at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114) > at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown > Source) > at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Locked ownable synchronizers: > - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker) > "controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 > nid=0x310 waiting on condition [0x7fccb55c8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005d1be5a00> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) > at kafka.zookeeper.ZooKee
[jira] [Updated] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leibo updated KAFKA-8532: - Attachment: js2.log > controller-event-thread deadlock with zk-session-expiry-handler0 > > > Key: KAFKA-8532 > URL: https://issues.apache.org/jira/browse/KAFKA-8532 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Blocker > Attachments: js.log, js0.log, js1.log, js2.log > > > We have observed a serious deadlock between controller-event-thead and > zk-session-expirey-handle thread. When this issue occurred, it's only one way > to recovery the kafka cluster is restart kafka server. The follows is the > jstack log of controller-event-thead and zk-session-expiry-handle thread. > "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 > tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005ee3f7000> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // > 等待controller-event-thread线程处理expireEvent > at > kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533) > at > kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173) > at > kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374) > at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown > Source) > at scala.collection.Iterator.foreach(Iterator.scala:937) > at scala.collection.Iterator.foreach$(Iterator.scala:937) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1425) > at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209) > at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428) > at > kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown > Source) > at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114) > at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown > Source) > at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Locked ownable synchronizers: > - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker) > "controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 > nid=0x310 waiting on condition [0x7fccb55c8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005d1be5a00> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) > at kafka.zookeeper.ZooKeeperClient.handleRequests(ZooKeeperClient.scala:157) > at > kafka.zk.KafkaZkClient.retryRequestsUntilConnected(KafkaZkClient.scala:1596) > at > kafka.zk.KafkaZkClient.ka
[jira] [Updated] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leibo updated KAFKA-8532: - Attachment: js0.log > controller-event-thread deadlock with zk-session-expiry-handler0 > > > Key: KAFKA-8532 > URL: https://issues.apache.org/jira/browse/KAFKA-8532 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Blocker > Attachments: js.log, js0.log, js1.log > > > We have observed a serious deadlock between controller-event-thead and > zk-session-expirey-handle thread. When this issue occurred, it's only one way > to recovery the kafka cluster is restart kafka server. The follows is the > jstack log of controller-event-thead and zk-session-expiry-handle thread. > "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 > tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005ee3f7000> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // > 等待controller-event-thread线程处理expireEvent > at > kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533) > at > kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173) > at > kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374) > at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown > Source) > at scala.collection.Iterator.foreach(Iterator.scala:937) > at scala.collection.Iterator.foreach$(Iterator.scala:937) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1425) > at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209) > at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428) > at > kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown > Source) > at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114) > at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown > Source) > at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Locked ownable synchronizers: > - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker) > "controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 > nid=0x310 waiting on condition [0x7fccb55c8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005d1be5a00> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) > at kafka.zookeeper.ZooKeeperClient.handleRequests(ZooKeeperClient.scala:157) > at > kafka.zk.KafkaZkClient.retryRequestsUntilConnected(KafkaZkClient.scala:1596) > at > kafka.zk.KafkaZkClient.kafka$zk$Ka
[jira] [Updated] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leibo updated KAFKA-8532: - Attachment: js1.log > controller-event-thread deadlock with zk-session-expiry-handler0 > > > Key: KAFKA-8532 > URL: https://issues.apache.org/jira/browse/KAFKA-8532 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Blocker > Attachments: js.log, js0.log, js1.log > > > We have observed a serious deadlock between controller-event-thead and > zk-session-expirey-handle thread. When this issue occurred, it's only one way > to recovery the kafka cluster is restart kafka server. The follows is the > jstack log of controller-event-thead and zk-session-expiry-handle thread. > "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 > tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005ee3f7000> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // > 等待controller-event-thread线程处理expireEvent > at > kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533) > at > kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173) > at > kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374) > at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown > Source) > at scala.collection.Iterator.foreach(Iterator.scala:937) > at scala.collection.Iterator.foreach$(Iterator.scala:937) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1425) > at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209) > at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428) > at > kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown > Source) > at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114) > at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown > Source) > at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Locked ownable synchronizers: > - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker) > "controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 > nid=0x310 waiting on condition [0x7fccb55c8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005d1be5a00> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) > at kafka.zookeeper.ZooKeeperClient.handleRequests(ZooKeeperClient.scala:157) > at > kafka.zk.KafkaZkClient.retryRequestsUntilConnected(KafkaZkClient.scala:1596) > at > kafka.zk.KafkaZkClient.kafka$zk$Ka
[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865266#comment-16865266 ] leibo edited comment on KAFKA-8532 at 6/17/19 11:15 AM: [~ijuma] I have tried to reproduced this ,but failed, it will appear after i restart kafka cluster a few days. The following info may be helpful: # We are running kafka cluster into docker container managed by kubernetes. # Until now, every time this issue occurred, we found the zookeeper container CPU are set to 1C while the host have more than 50C CPU. # When we set to zookeeper container 1C cpu(The host have more than 50C cpu), due to docker's problem, the zookeeper running in docker container will have more than 38 GC work threads(According to JVM official document, 1C cpu should correspond 2 GC work threads), these gc work thread may slow down zookeeper , and maybe let kafka and zookeeper session timed out. # After we update zookeeper jvm *-XX:ParallelGCThreads=4* , this issue is not occurred yet, but I think this Optimizing measure is just reduce the probability of this problem, not solved it. According to above info, I think when this problem occurred , the zookeeper is not running well, and kafka cluster deadlock occurred, but after some time, zookeeper recover normal running, kafka is not . was (Author: lbdai3190): [~ijuma] I have tried to reproduced this ,but failed, it will appear after i restart kafka cluster a few days. The following info may be helpful: # We are running kafka cluster in to docker container managed by kubernetes. # Until now, every time this issue occurred, we found the zookeeper container CPU are set to 1C while the host have more than 50C CPU. # When we set to zookeeper container 1C cpu(The host have more than 50C cpu), due to docker's problem, the zookeeper running in docker container will have more than 38 GC work threads(According to JVM official document, 1C cpu should correspond 2 GC work threads), these gc work thread may slow down zookeeper , and maybe let kafka and zookeeper session timed out. # After we update zookeeper jvm *-XX:ParallelGCThreads=4* , this issue is not occurred yet, but I think this Optimizing measure is just reduce the probability of this problem, not solved it. According to above info, I think when this problem occurred , the zookeeper is not running well, and kafka cluster deadlock occurred, but after some time, zookeeper recover normal running, kafka is not . > controller-event-thread deadlock with zk-session-expiry-handler0 > > > Key: KAFKA-8532 > URL: https://issues.apache.org/jira/browse/KAFKA-8532 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Blocker > Attachments: js.log > > > We have observed a serious deadlock between controller-event-thead and > zk-session-expirey-handle thread. When this issue occurred, it's only one way > to recovery the kafka cluster is restart kafka server. The follows is the > jstack log of controller-event-thead and zk-session-expiry-handle thread. > "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 > tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005ee3f7000> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // > 等待controller-event-thread线程处理expireEvent > at > kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533) > at > kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173) > at > kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374) > at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown > Source) > at scala.collection.Iterator.foreach(Iterator.scala:937) > at scala.collection.Iterator.foreach$(Iterator.scala:937) > at scala.collection.AbstractIterator.foreach(Ite
[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865266#comment-16865266 ] leibo edited comment on KAFKA-8532 at 6/17/19 2:53 AM: --- [~ijuma] I have tried to reproduced this ,but failed, it will appear after i restart kafka cluster a few days. The following info may be helpful: # We are running kafka cluster in to docker container managed by kubernetes. # Until now, every time this issue occurred, we found the zookeeper container CPU are set to 1C while the host have more than 50C CPU. # When we set to zookeeper container 1C cpu(The host have more than 50C cpu), due to docker's problem, the zookeeper running in docker container will have more than 38 GC work threads(According to JVM official document, 1C cpu should correspond 2 GC work threads), these gc work thread may slow down zookeeper , and maybe let kafka and zookeeper session timed out. # After we update zookeeper jvm *-XX:ParallelGCThreads=4* , this issue is not occurred yet, but I think this Optimizing measure is just reduce the probability of this problem, not solved it. According to above info, I think when this problem occurred , the zookeeper is not running well, and kafka cluster deadlock occurred, but after some time, zookeeper recover normal running, kafka is not . was (Author: lbdai3190): [~ijuma] I have tried to reproduced this ,but failed, it will appear after i restart kafka cluster a few days. The following info may be helpful: # We are running kafka cluster in to docker container managed by kubernetes. # Until now, every time this issue occurred, we found the zookeeper container CPU are set to 1C while the host have more than 50C CPU. # When we set to zookeeper container 1C cpu(The host have more than 50C cpu), due to docker's problem, the zookeeper running in docker container will have more than 38 GC work threads(According to JVM official document, 1C cpu should correspond 2 GC work threads), these gc work thread may slow down zookeeper , and maybe let kafka and zookeeper session timed out. # After we update zookeeper jvm *-XX:ParallelGCThreads=4* , this issue is not occurred yet, but I think this Optimizing measure is just reduce the probability of this problem, not solved it. According to above info, I think when this problem occurred , the zookeeper is not running well, and kafka cluster deadlock occurred, but after some time, zookeeper recovery normal running, kafka is not . > controller-event-thread deadlock with zk-session-expiry-handler0 > > > Key: KAFKA-8532 > URL: https://issues.apache.org/jira/browse/KAFKA-8532 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Blocker > Attachments: js.log > > > We have observed a serious deadlock between controller-event-thead and > zk-session-expirey-handle thread. When this issue occurred, it's only one way > to recovery the kafka cluster is restart kafka server. The follows is the > jstack log of controller-event-thead and zk-session-expiry-handle thread. > "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 > tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005ee3f7000> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // > 等待controller-event-thread线程处理expireEvent > at > kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533) > at > kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173) > at > kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374) > at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown > Source) > at scala.collection.Iterator.foreach(Iterator.scala:937) > at scala.collection.Iterator.foreach$(Iterator.scala:937) > at scala.collection.AbstractIterator.foreach(Ite
[jira] [Commented] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865266#comment-16865266 ] leibo commented on KAFKA-8532: -- [~ijuma] I have tried to reproduced this ,but failed, it will appear after i restart kafka cluster a few days. The following info may be helpful: # We are running kafka cluster in to docker container managed by kubernetes. # Until now, every time this issue occurred, we found the zookeeper container CPU are set to 1C while the host have more than 50C CPU. # When we set to zookeeper container 1C cpu(The host have more than 50C cpu), due to docker's problem, the zookeeper running in docker container will have more than 38 GC work threads(According to JVM official document, 1C cpu should correspond 2 GC work threads), these gc work thread may slow down zookeeper , and maybe let kafka and zookeeper session timed out. # After we update zookeeper jvm *-XX:ParallelGCThreads=4* , this issue is not occurred yet, but I think this Optimizing measure is just reduce the probability of this problem, not solved it. According to above info, I think when this problem occurred , the zookeeper is not running well, and kafka cluster deadlock occurred, but after some time, zookeeper recovery normal running, kafka is not . > controller-event-thread deadlock with zk-session-expiry-handler0 > > > Key: KAFKA-8532 > URL: https://issues.apache.org/jira/browse/KAFKA-8532 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Blocker > Attachments: js.log > > > We have observed a serious deadlock between controller-event-thead and > zk-session-expirey-handle thread. When this issue occurred, it's only one way > to recovery the kafka cluster is restart kafka server. The follows is the > jstack log of controller-event-thead and zk-session-expiry-handle thread. > "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 > tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005ee3f7000> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // > 等待controller-event-thread线程处理expireEvent > at > kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533) > at > kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173) > at > kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374) > at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown > Source) > at scala.collection.Iterator.foreach(Iterator.scala:937) > at scala.collection.Iterator.foreach$(Iterator.scala:937) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1425) > at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209) > at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428) > at > kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown > Source) > at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114) > at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown > Source) > at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:7
[jira] [Commented] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863673#comment-16863673 ] leibo commented on KAFKA-8532: -- [~junrao] Thank you for your reply. Because we need to recovery the kafka cluster ,so we have restart the kafka server. Follows is my today's analysis: I noticed when expiryScheduler connect zookeeper server failed, it will throw exception immediately as follow code: {code:java} //代码占位符 expiryScheduler.startup() try waitUntilConnected(connectionTimeoutMs, TimeUnit.MILLISECONDS) catch { case e: Throwable => close() throw e }{code} close() {code:java} //代码占位符 def close(): Unit = { info("Closing.") inWriteLock(initializationLock) { zNodeChangeHandlers.clear() zNodeChildChangeHandlers.clear() stateChangeHandlers.clear() zooKeeper.close() metricNames.foreach(removeMetric(_)) } // Shutdown scheduler outside of lock to avoid deadlock if scheduler // is waiting for lock to process session expiry expiryScheduler.shutdown() info("Closed.") } {code} expiryScheduler.shutdown() {code:java} //代码占位符 override def shutdown() { debug("Shutting down task scheduler.") // We use the local variable to avoid NullPointerException if another thread shuts down scheduler at same time. val cachedExecutor = this.executor if (cachedExecutor != null) { this synchronized { cachedExecutor.shutdown() this.executor = null } cachedExecutor.awaitTermination(1, TimeUnit.DAYS) } } {code} {color:#ff}we see expiryScheduler.shutdown() may execute timeout *1 day* , it too long.{color} {color:#33}I remember that there are no zookeeper connection or heartbeat log in the server.log in 1 day when the issue occurred.{color} {color:#33}I wonder if shutdown() continued for 1 day, the zookeeper instance will not be {color}reinitialize ?{color:#33} {color} > controller-event-thread deadlock with zk-session-expiry-handler0 > > > Key: KAFKA-8532 > URL: https://issues.apache.org/jira/browse/KAFKA-8532 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Blocker > Attachments: js.log > > > We have observed a serious deadlock between controller-event-thead and > zk-session-expirey-handle thread. When this issue occurred, it's only one way > to recovery the kafka cluster is restart kafka server. The follows is the > jstack log of controller-event-thead and zk-session-expiry-handle thread. > "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 > tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005ee3f7000> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // > 等待controller-event-thread线程处理expireEvent > at > kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533) > at > kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173) > at > kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374) > at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown > Source) > at scala.collection.Iterator.foreach(Iterator.scala:937) > at scala.collection.Iterator.foreach$(Iterator.scala:937) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1425) > at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209) > at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428) > at > kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown > Source) > at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114) > at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown > Source) > at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63
[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862668#comment-16862668 ] leibo edited comment on KAFKA-8532 at 6/14/19 2:59 AM: --- Here is some analysis of mine according to the jstack log and source code: When the deadlock occurred, # The zk-session-expiry-handler0 thread is reinitializing a new zookeeper instance, the expireEvent was put to queue of ControllerEventManager and waitting to handle by controller-event-handle thread. (*In the time, kafka is disconnected with zookeeper yet*). # However , in the same time, the controller-event-thread is handling RegisterBrokerAndReelect(type of ControllerEvent), as follows: {code:java} //KafkaController.scala case object RegisterBrokerAndReelect extends ControllerEvent { override def state: ControllerState = ControllerState.ControllerChange override def process(): Unit = { zkClient.registerBroker(brokerInfo) Reelect.process() } } {code} In this process, it is blocked at ZookeeperClient.handleRequests, code as follows: {code:java} //代码占位符 def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] = { if (requests.isEmpty) Seq.empty else { val countDownLatch = new CountDownLatch(requests.size) val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size) requests.foreach { request => inFlightRequests.acquire() try { inReadLock(initializationLock) { send(request) { response => responseQueue.add(response) inFlightRequests.release() countDownLatch.countDown() } } } catch { case e: Throwable => inFlightRequests.release() throw e } } countDownLatch.await() responseQueue.asScala.toBuffer } } {code} At this time, Zookeeper server is running normally, and all function is working well. I suspect the key point is the *initializationLock* , because if isReadLock is success, it will throw exception when kafka disconnected with zookeeper, but we never see any exception log in the kafka server.log controller-event-thread can not doing inReadLock(initializationLock) perhaps due to writeLock was not been release. There are tow way to donging inWriteLock(initailizationLock): 1. close(): {code:java} expiryScheduler.startup() try waitUntilConnected(connectionTimeoutMs, TimeUnit.MILLISECONDS) catch { case e: Throwable => close() //此处close的时候,加写锁 throw e } def close(): Unit = { info("Closing.") inWriteLock(initializationLock) { zNodeChangeHandlers.clear() zNodeChildChangeHandlers.clear() stateChangeHandlers.clear() zooKeeper.close() metricNames.foreach(removeMetric(_)) } // Shutdown scheduler outside of lock to avoid deadlock if scheduler // is waiting for lock to process session expiry expiryScheduler.shutdown() info("Closed.") } {code} 2. reinitialize() {code:java} private def reinitialize(): Unit = { // Initialization callbacks are invoked outside of the lock to avoid deadlock potential since their completion // may require additional Zookeeper requests, which will block to acquire the initialization lock stateChangeHandlers.values.foreach(callBeforeInitializingSession _) // zk-session-expiry-handler0 is blocked here inWriteLock(initializationLock) {// program is not running here if (!connectionState.isAlive) { zooKeeper.close() info(s"Initializing a new session to $connectString.") // retry forever until ZooKeeper can be instantiated var connected = false while (!connected) { try { zooKeeper = new ZooKeeper(connectString, sessionTimeoutMs, ZooKeeperClientWatcher) connected = true } catch { case e: Exception => info("Error when recreating ZooKeeper, retrying after a short sleep", e) Thread.sleep(1000) } } } } stateChangeHandlers.values.foreach(callAfterInitializingSession _) } {code} so i think the problem is occurred in function close() was (Author: lbdai3190): Here is some analysis of mine according to the jstack log and source code: When the deadlock occurred, # The zk-session-expiry-handler0 thread is reinitializing a new zookeeper instance, the expireEvent was put to queue of ControllerEventManager and waitting to handle by controller-event-handle thread. (*In the time, kafka is disconnected with zookeeper yet*). # However , in the same time, the controller-event-thread is handling Startup(type of ControllerEvent), as follows: {code:java} //KafkaController.scala case object Startup extends ControllerEvent { def state = ControllerState.ControllerChange override def process(): Unit = { zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler) elect() }
[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862772#comment-16862772 ] leibo edited comment on KAFKA-8532 at 6/13/19 12:20 PM: [~te...@apache.org] I saw you have handled KAFKA-6879, I wish you can help me to analysis this problem, thank you. was (Author: lbdai3190): [~te...@apache.org] As you have handled KAFKA-6879, I wish you can help me to analysis this problem, thank you. > controller-event-thread deadlock with zk-session-expiry-handler0 > > > Key: KAFKA-8532 > URL: https://issues.apache.org/jira/browse/KAFKA-8532 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Blocker > Attachments: js.log > > > We have observed a serious deadlock between controller-event-thead and > zk-session-expirey-handle thread. When this issue occurred, it's only one way > to recovery the kafka cluster is restart kafka server. The follows is the > jstack log of controller-event-thead and zk-session-expiry-handle thread. > "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 > tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005ee3f7000> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // > 等待controller-event-thread线程处理expireEvent > at > kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533) > at > kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173) > at > kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374) > at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown > Source) > at scala.collection.Iterator.foreach(Iterator.scala:937) > at scala.collection.Iterator.foreach$(Iterator.scala:937) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1425) > at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209) > at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428) > at > kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown > Source) > at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114) > at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown > Source) > at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Locked ownable synchronizers: > - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker) > "controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 > nid=0x310 waiting on condition [0x7fccb55c8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005d1be5a00> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterrup
[jira] [Updated] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leibo updated KAFKA-8532: - Description: We have observed a serious deadlock between controller-event-thead and zk-session-expirey-handle thread. When this issue occurred, it's only one way to recovery the kafka cluster is restart kafka server. The follows is the jstack log of controller-event-thead and zk-session-expiry-handle thread. "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0005ee3f7000> (a java.util.concurrent.CountDownLatch$Sync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // 等待controller-event-thread线程处理expireEvent at kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533) at kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173) at kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408) at kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374) at kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374) at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown Source) at scala.collection.Iterator.foreach(Iterator.scala:937) at scala.collection.Iterator.foreach$(Iterator.scala:937) at scala.collection.AbstractIterator.foreach(Iterator.scala:1425) at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209) at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374) at kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428) at kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown Source) at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114) at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown Source) at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Locked ownable synchronizers: - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker) "controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 nid=0x310 waiting on condition [0x7fccb55c8000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0005d1be5a00> (a java.util.concurrent.CountDownLatch$Sync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) at kafka.zookeeper.ZooKeeperClient.handleRequests(ZooKeeperClient.scala:157) at kafka.zk.KafkaZkClient.retryRequestsUntilConnected(KafkaZkClient.scala:1596) at kafka.zk.KafkaZkClient.kafka$zk$KafkaZkClient$$retryRequestUntilConnected(KafkaZkClient.scala:1589) at kafka.zk.KafkaZkClient.deletePreferredReplicaElection(KafkaZkClient.scala:989) at kafka.controller.KafkaController.removePartitionsFromPreferredReplicaElection(KafkaController.scala:873) at kafka.controller.KafkaController.kafka$controller$KafkaController$$onPreferredReplicaElection(KafkaController.scala:631) at kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:266) at kafka.controller.KafkaController.kafka$controller$KafkaController$$elect(KafkaController.scala:1221) at kafka.controller.KafkaController$Reelect$.pr
[jira] [Commented] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862772#comment-16862772 ] leibo commented on KAFKA-8532: -- [~te...@apache.org] As you have handled KAFKA-6879, I wish you can help me to analysis this problem, thank you. > controller-event-thread deadlock with zk-session-expiry-handler0 > > > Key: KAFKA-8532 > URL: https://issues.apache.org/jira/browse/KAFKA-8532 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Blocker > Attachments: js.log > > > We have observed a serious deadlock between controller-event-thead and > zk-session-expirey-handle thread. When this issue occurred, it's only way to > recovery the kafka cluster is restart kafka server. The follows is the > jstack log of controller-event-thead and zk-session-expiry-handle thread. > "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 > tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005ee3f7000> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // > 等待controller-event-thread线程处理expireEvent > at > kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533) > at > kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173) > at > kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374) > at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown > Source) > at scala.collection.Iterator.foreach(Iterator.scala:937) > at scala.collection.Iterator.foreach$(Iterator.scala:937) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1425) > at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209) > at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428) > at > kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown > Source) > at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114) > at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown > Source) > at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Locked ownable synchronizers: > - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker) > "controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 > nid=0x310 waiting on condition [0x7fccb55c8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005d1be5a00> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) > at kafka.zookeeper.ZooKeeperClient.handleRequests(ZooKeeperClient.scala:157) >
[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862668#comment-16862668 ] leibo edited comment on KAFKA-8532 at 6/13/19 6:46 AM: --- Here is some analysis of mine according to the jstack log and source code: When the deadlock occurred, # The zk-session-expiry-handler0 thread is reinitializing a new zookeeper instance, the expireEvent was put to queue of ControllerEventManager and waitting to handle by controller-event-handle thread. (*In the time, kafka is disconnected with zookeeper yet*). # However , in the same time, the controller-event-thread is handling Startup(type of ControllerEvent), as follows: {code:java} //KafkaController.scala case object Startup extends ControllerEvent { def state = ControllerState.ControllerChange override def process(): Unit = { zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler) elect() } } {code} In this process, it is blocked at ZookeeperClient.handleRequests, code as follows: {code:java} //代码占位符 def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] = { if (requests.isEmpty) Seq.empty else { val countDownLatch = new CountDownLatch(requests.size) val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size) requests.foreach { request => inFlightRequests.acquire() try { inReadLock(initializationLock) { send(request) { response => responseQueue.add(response) inFlightRequests.release() countDownLatch.countDown() } } } catch { case e: Throwable => inFlightRequests.release() throw e } } countDownLatch.await() responseQueue.asScala.toBuffer } } {code} At this time, Zookeeper server is running normally, and all function is working well. I suspect the key point is the *initializationLock* , because if isReadLock is success, it will throw exception when kafka disconnected with zookeeper, but we never see any exception log in the kafka server.log controller-event-thread can not doing inReadLock(initializationLock) perhaps due to writeLock was not been release. There are tow way to donging inWriteLock(initailizationLock): 1. close(): {code:java} expiryScheduler.startup() try waitUntilConnected(connectionTimeoutMs, TimeUnit.MILLISECONDS) catch { case e: Throwable => close() //此处close的时候,加写锁 throw e } def close(): Unit = { info("Closing.") inWriteLock(initializationLock) { zNodeChangeHandlers.clear() zNodeChildChangeHandlers.clear() stateChangeHandlers.clear() zooKeeper.close() metricNames.foreach(removeMetric(_)) } // Shutdown scheduler outside of lock to avoid deadlock if scheduler // is waiting for lock to process session expiry expiryScheduler.shutdown() info("Closed.") } {code} 2. reinitialize() {code:java} private def reinitialize(): Unit = { // Initialization callbacks are invoked outside of the lock to avoid deadlock potential since their completion // may require additional Zookeeper requests, which will block to acquire the initialization lock stateChangeHandlers.values.foreach(callBeforeInitializingSession _) // zk-session-expiry-handler0 is blocked here inWriteLock(initializationLock) {// program is not running here if (!connectionState.isAlive) { zooKeeper.close() info(s"Initializing a new session to $connectString.") // retry forever until ZooKeeper can be instantiated var connected = false while (!connected) { try { zooKeeper = new ZooKeeper(connectString, sessionTimeoutMs, ZooKeeperClientWatcher) connected = true } catch { case e: Exception => info("Error when recreating ZooKeeper, retrying after a short sleep", e) Thread.sleep(1000) } } } } stateChangeHandlers.values.foreach(callAfterInitializingSession _) } {code} so i think the problem is occurred in function close() was (Author: lbdai3190): Here is some analysis of mine according to the jstack log and source code: When the deadlock occurred, # The zk-session-expiry-handler0 thread is reinitializing a new zookeeper instance, the expireEvent was put to queue of ControllerEventManager and waitting to handle by controller-event-handle thread. (*In the time, kafka is disconnected with zookeeper yet*). # However , in the same time, the controller-event-thread is handling Startup(type of ControllerEvent), as bellow: {code:java} //KafkaController.scala case object Startup extends ControllerEvent { def state = ControllerState.ControllerChange override def process(): Unit = { zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler) elect() } } {code} In this process,
[jira] [Updated] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leibo updated KAFKA-8532: - Description: We have observed a serious deadlock between controller-event-thead and zk-session-expirey-handle thread. When this issue occurred, it's only way to recovery the kafka cluster is restart kafka server. The follows is the jstack log of controller-event-thead and zk-session-expiry-handle thread. "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0005ee3f7000> (a java.util.concurrent.CountDownLatch$Sync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // 等待controller-event-thread线程处理expireEvent at kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533) at kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173) at kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408) at kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374) at kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374) at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown Source) at scala.collection.Iterator.foreach(Iterator.scala:937) at scala.collection.Iterator.foreach$(Iterator.scala:937) at scala.collection.AbstractIterator.foreach(Iterator.scala:1425) at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209) at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374) at kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428) at kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown Source) at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114) at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown Source) at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Locked ownable synchronizers: - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker) "controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 nid=0x310 waiting on condition [0x7fccb55c8000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0005d1be5a00> (a java.util.concurrent.CountDownLatch$Sync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) at kafka.zookeeper.ZooKeeperClient.handleRequests(ZooKeeperClient.scala:157) at kafka.zk.KafkaZkClient.retryRequestsUntilConnected(KafkaZkClient.scala:1596) at kafka.zk.KafkaZkClient.kafka$zk$KafkaZkClient$$retryRequestUntilConnected(KafkaZkClient.scala:1589) at kafka.zk.KafkaZkClient.deletePreferredReplicaElection(KafkaZkClient.scala:989) at kafka.controller.KafkaController.removePartitionsFromPreferredReplicaElection(KafkaController.scala:873) at kafka.controller.KafkaController.kafka$controller$KafkaController$$onPreferredReplicaElection(KafkaController.scala:631) at kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:266) at kafka.controller.KafkaController.kafka$controller$KafkaController$$elect(KafkaController.scala:1221) at kafka.controller.KafkaController$Reelect$.proces
[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862668#comment-16862668 ] leibo edited comment on KAFKA-8532 at 6/13/19 6:45 AM: --- Here is some analysis of mine according to the jstack log and source code: When the deadlock occurred, # The zk-session-expiry-handler0 thread is reinitializing a new zookeeper instance, the expireEvent was put to queue of ControllerEventManager and waitting to handle by controller-event-handle thread. (*In the time, kafka is disconnected with zookeeper yet*). # However , in the same time, the controller-event-thread is handling Startup(type of ControllerEvent), as bellow: {code:java} //KafkaController.scala case object Startup extends ControllerEvent { def state = ControllerState.ControllerChange override def process(): Unit = { zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler) elect() } } {code} In this process, it is blocked at ZookeeperClient.handleRequests, code as bellow: {code:java} //代码占位符 def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] = { if (requests.isEmpty) Seq.empty else { val countDownLatch = new CountDownLatch(requests.size) val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size) requests.foreach { request => inFlightRequests.acquire() try { inReadLock(initializationLock) { send(request) { response => responseQueue.add(response) inFlightRequests.release() countDownLatch.countDown() } } } catch { case e: Throwable => inFlightRequests.release() throw e } } countDownLatch.await() responseQueue.asScala.toBuffer } } {code} At this time, Zookeeper server is running normally, and all function is working well. I suspect the key point is the *initializationLock* , because if isReadLock is success, it will throw exception when kafka disconnected with zookeeper, but we never see any exception log in the kafka server.log controller-event-thread can not doing inReadLock(initializationLock) perhaps due to writeLock was not been release. There are tow way to donging inWriteLock(initailizationLock): 1. close(): {code:java} expiryScheduler.startup() try waitUntilConnected(connectionTimeoutMs, TimeUnit.MILLISECONDS) catch { case e: Throwable => close() //此处close的时候,加写锁 throw e } def close(): Unit = { info("Closing.") inWriteLock(initializationLock) { zNodeChangeHandlers.clear() zNodeChildChangeHandlers.clear() stateChangeHandlers.clear() zooKeeper.close() metricNames.foreach(removeMetric(_)) } // Shutdown scheduler outside of lock to avoid deadlock if scheduler // is waiting for lock to process session expiry expiryScheduler.shutdown() info("Closed.") } {code} 2. reinitialize() {code:java} private def reinitialize(): Unit = { // Initialization callbacks are invoked outside of the lock to avoid deadlock potential since their completion // may require additional Zookeeper requests, which will block to acquire the initialization lock stateChangeHandlers.values.foreach(callBeforeInitializingSession _) // zk-session-expiry-handler0 is blocked here inWriteLock(initializationLock) {// program is not running here if (!connectionState.isAlive) { zooKeeper.close() info(s"Initializing a new session to $connectString.") // retry forever until ZooKeeper can be instantiated var connected = false while (!connected) { try { zooKeeper = new ZooKeeper(connectString, sessionTimeoutMs, ZooKeeperClientWatcher) connected = true } catch { case e: Exception => info("Error when recreating ZooKeeper, retrying after a short sleep", e) Thread.sleep(1000) } } } } stateChangeHandlers.values.foreach(callAfterInitializingSession _) } {code} so i think the problem is occurred in function close() was (Author: lbdai3190): Here is some analysis of mine according to the jstack log and source code: When the deadlock occurred, # The zk-session-expiry-handler0 thread is reinitializing a new zookeeper instance, the expireEvent was put to queue of ControllerEventManager and waitting to handle by controller-event-handle thread. (*In the time, kafka is disconnected with zookeeper yet*). # However , in the same time, the controller-event-thread is handling Startup type of ControllerEvent, as bellow: {code:java} //KafkaController.scala case object Startup extends ControllerEvent { def state = ControllerState.ControllerChange override def process(): Unit = { zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler) elect() } } {code} In this process, it
[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862668#comment-16862668 ] leibo edited comment on KAFKA-8532 at 6/13/19 3:58 AM: --- Here is some analysis of mine according to the jstack log and source code: When the deadlock occurred, # The zk-session-expiry-handler0 thread is reinitializing a new zookeeper instance, the expireEvent was put to queue of ControllerEventManager and waitting to handle by controller-event-handle thread. (*In the time, kafka is disconnected with zookeeper yet*). # However , in the same time, the controller-event-thread is handling Startup type of ControllerEvent, as bellow: {code:java} //KafkaController.scala case object Startup extends ControllerEvent { def state = ControllerState.ControllerChange override def process(): Unit = { zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler) elect() } } {code} In this process, it is blocked at ZookeeperClient.handleRequests, code as bellow: {code:java} //代码占位符 def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] = { if (requests.isEmpty) Seq.empty else { val countDownLatch = new CountDownLatch(requests.size) val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size) requests.foreach { request => inFlightRequests.acquire() try { inReadLock(initializationLock) { send(request) { response => responseQueue.add(response) inFlightRequests.release() countDownLatch.countDown() } } } catch { case e: Throwable => inFlightRequests.release() throw e } } countDownLatch.await() responseQueue.asScala.toBuffer } } {code} At this time, Zookeeper server is running normally, and all function is working well. I suspect the key point is the *initializationLock* , because if isReadLock is success, it will throw exception when kafka disconnected with zookeeper, but we never see any exception log in the kafka server.log controller-event-thread can not doing inReadLock(initializationLock) perhaps due to writeLock was not been release. There are tow way to donging inWriteLock(initailizationLock): 1. close(): {code:java} def close(): Unit = { info("Closing.") inWriteLock(initializationLock) { zNodeChangeHandlers.clear() zNodeChildChangeHandlers.clear() stateChangeHandlers.clear() zooKeeper.close() metricNames.foreach(removeMetric(_)) } // Shutdown scheduler outside of lock to avoid deadlock if scheduler // is waiting for lock to process session expiry expiryScheduler.shutdown() info("Closed.") } {code} 2. reinitialize() {code:java} private def reinitialize(): Unit = { // Initialization callbacks are invoked outside of the lock to avoid deadlock potential since their completion // may require additional Zookeeper requests, which will block to acquire the initialization lock stateChangeHandlers.values.foreach(callBeforeInitializingSession _) // zk-session-expiry-handler0 is blocked here inWriteLock(initializationLock) {// program is not running here if (!connectionState.isAlive) { zooKeeper.close() info(s"Initializing a new session to $connectString.") // retry forever until ZooKeeper can be instantiated var connected = false while (!connected) { try { zooKeeper = new ZooKeeper(connectString, sessionTimeoutMs, ZooKeeperClientWatcher) connected = true } catch { case e: Exception => info("Error when recreating ZooKeeper, retrying after a short sleep", e) Thread.sleep(1000) } } } } stateChangeHandlers.values.foreach(callAfterInitializingSession _) } {code} so i think the problem is occurred in function close() was (Author: lbdai3190): Here is some analysis of mine according to the jstack log and source code: When the deadlock occurred, # The zk-session-expiry-handler0 thread is reinitializing a new zookeeper instance, the expireEvent was put to queue of ControllerEventManager and waitting to handle by controller-event-handle thread. (*In the time, kafka is disconnected with zookeeper yet*). # However , in the same time, the controller-event-thread is handling Startup type of ControllerEvent, as bellow: {code:java} //KafkaController.scala case object Startup extends ControllerEvent { def state = ControllerState.ControllerChange override def process(): Unit = { zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler) elect() } } {code} In this process, it is blocked at ZookeeperClient.handleRequests, code as bellow: {code:java} //代码占位符 def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] = { if (
[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862668#comment-16862668 ] leibo edited comment on KAFKA-8532 at 6/13/19 4:00 AM: --- Here is some analysis of mine according to the jstack log and source code: When the deadlock occurred, # The zk-session-expiry-handler0 thread is reinitializing a new zookeeper instance, the expireEvent was put to queue of ControllerEventManager and waitting to handle by controller-event-handle thread. (*In the time, kafka is disconnected with zookeeper yet*). # However , in the same time, the controller-event-thread is handling Startup type of ControllerEvent, as bellow: {code:java} //KafkaController.scala case object Startup extends ControllerEvent { def state = ControllerState.ControllerChange override def process(): Unit = { zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler) elect() } } {code} In this process, it is blocked at ZookeeperClient.handleRequests, code as bellow: {code:java} //代码占位符 def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] = { if (requests.isEmpty) Seq.empty else { val countDownLatch = new CountDownLatch(requests.size) val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size) requests.foreach { request => inFlightRequests.acquire() try { inReadLock(initializationLock) { send(request) { response => responseQueue.add(response) inFlightRequests.release() countDownLatch.countDown() } } } catch { case e: Throwable => inFlightRequests.release() throw e } } countDownLatch.await() responseQueue.asScala.toBuffer } } {code} At this time, Zookeeper server is running normally, and all function is working well. I suspect the key point is the *initializationLock* , because if isReadLock is success, it will throw exception when kafka disconnected with zookeeper, but we never see any exception log in the kafka server.log controller-event-thread can not doing inReadLock(initializationLock) perhaps due to writeLock was not been release. There are tow way to donging inWriteLock(initailizationLock): 1. close(): {code:java} expiryScheduler.startup() try waitUntilConnected(connectionTimeoutMs, TimeUnit.MILLISECONDS) catch { case e: Throwable => close() //此处close的时候,加写锁 throw e } def close(): Unit = { info("Closing.") inWriteLock(initializationLock) { zNodeChangeHandlers.clear() zNodeChildChangeHandlers.clear() stateChangeHandlers.clear() zooKeeper.close() metricNames.foreach(removeMetric(_)) } // Shutdown scheduler outside of lock to avoid deadlock if scheduler // is waiting for lock to process session expiry expiryScheduler.shutdown() info("Closed.") } {code} 2. reinitialize() {code:java} private def reinitialize(): Unit = { // Initialization callbacks are invoked outside of the lock to avoid deadlock potential since their completion // may require additional Zookeeper requests, which will block to acquire the initialization lock stateChangeHandlers.values.foreach(callBeforeInitializingSession _) // zk-session-expiry-handler0 is blocked here inWriteLock(initializationLock) {// program is not running here if (!connectionState.isAlive) { zooKeeper.close() info(s"Initializing a new session to $connectString.") // retry forever until ZooKeeper can be instantiated var connected = false while (!connected) { try { zooKeeper = new ZooKeeper(connectString, sessionTimeoutMs, ZooKeeperClientWatcher) connected = true } catch { case e: Exception => info("Error when recreating ZooKeeper, retrying after a short sleep", e) Thread.sleep(1000) } } } } stateChangeHandlers.values.foreach(callAfterInitializingSession _) } {code} so i think the problem is occurred in function close() was (Author: lbdai3190): Here is some analysis of mine according to the jstack log and source code: When the deadlock occurred, # The zk-session-expiry-handler0 thread is reinitializing a new zookeeper instance, the expireEvent was put to queue of ControllerEventManager and waitting to handle by controller-event-handle thread. (*In the time, kafka is disconnected with zookeeper yet*). # However , in the same time, the controller-event-thread is handling Startup type of ControllerEvent, as bellow: {code:java} //KafkaController.scala case object Startup extends ControllerEvent { def state = ControllerState.ControllerChange override def process(): Unit = { zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler) elect() } } {code} In this process, it
[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862668#comment-16862668 ] leibo edited comment on KAFKA-8532 at 6/13/19 3:58 AM: --- Here is some analysis of mine according to the jstack log and source code: When the deadlock occurred, # The zk-session-expiry-handler0 thread is reinitializing a new zookeeper instance, the expireEvent was put to queue of ControllerEventManager and waitting to handle by controller-event-handle thread. (*In the time, kafka is disconnected with zookeeper yet*). # However , in the same time, the controller-event-thread is handling Startup type of ControllerEvent, as bellow: {code:java} //KafkaController.scala case object Startup extends ControllerEvent { def state = ControllerState.ControllerChange override def process(): Unit = { zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler) elect() } } {code} In this process, it is blocked at ZookeeperClient.handleRequests, code as bellow: {code:java} //代码占位符 def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] = { if (requests.isEmpty) Seq.empty else { val countDownLatch = new CountDownLatch(requests.size) val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size) requests.foreach { request => inFlightRequests.acquire() try { inReadLock(initializationLock) { send(request) { response => responseQueue.add(response) inFlightRequests.release() countDownLatch.countDown() } } } catch { case e: Throwable => inFlightRequests.release() throw e } } countDownLatch.await() responseQueue.asScala.toBuffer } } {code} At this time, Zookeeper server is running normally, and all function is working well. I suspect the key point is the *initializationLock* , because if isReadLock is success, it will throw exception when kafka disconnected with zookeeper, but we never see any exception log in the kafka server.log controller-event-thread can not doing inReadLock(initializationLock) perhaps due to writeLock was not been release. There are tow way to donging inWriteLock(initailizationLock): 1. close(): {code:java} def close(): Unit = { info("Closing.") inWriteLock(initializationLock) { zNodeChangeHandlers.clear() zNodeChildChangeHandlers.clear() stateChangeHandlers.clear() zooKeeper.close() metricNames.foreach(removeMetric(_)) } // Shutdown scheduler outside of lock to avoid deadlock if scheduler // is waiting for lock to process session expiry expiryScheduler.shutdown() info("Closed.") } {code} 2. reinitialize() {code:java} private def reinitialize(): Unit = { // Initialization callbacks are invoked outside of the lock to avoid deadlock potential since their completion // may require additional Zookeeper requests, which will block to acquire the initialization lock stateChangeHandlers.values.foreach(callBeforeInitializingSession _) // zk-session-expiry-handler0 is blocked here inWriteLock(initializationLock) {// program is not running here if (!connectionState.isAlive) { zooKeeper.close() info(s"Initializing a new session to $connectString.") // retry forever until ZooKeeper can be instantiated var connected = false while (!connected) { try { zooKeeper = new ZooKeeper(connectString, sessionTimeoutMs, ZooKeeperClientWatcher) connected = true } catch { case e: Exception => info("Error when recreating ZooKeeper, retrying after a short sleep", e) Thread.sleep(1000) } } } } stateChangeHandlers.values.foreach(callAfterInitializingSession _) } {code} so i think the problem is occurred in function close() was (Author: lbdai3190): Here is some analysis of mine according to the jstack log and source code: When the deadlock occurred, # The zk-session-expiry-handler0 thread is reinitializing a new zookeeper instance, the expireEvent was put to queue of ControllerEventManager and waitting to handle by controller-event-handle thread. (*In the time, kafka is disconnected with zookeeper yet*). # However , in the same time, the controller-event-thread is handling Startup type of ControllerEvent, as bellow: {code:java} //KafkaController.scala case object Startup extends ControllerEvent { def state = ControllerState.ControllerChange override def process(): Unit = { zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler) elect() } } {code} In this process, it is blocked at ZookeeperClient.handleRequests, code as bellow: {code:java} //代码占位符 def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] = {
[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862668#comment-16862668 ] leibo edited comment on KAFKA-8532 at 6/13/19 3:58 AM: --- Here is some analysis of mine according to the jstack log and source code: When the deadlock occurred, # The zk-session-expiry-handler0 thread is reinitializing a new zookeeper instance, the expireEvent was put to queue of ControllerEventManager and waitting to handle by controller-event-handle thread. (*In the time, kafka is disconnected with zookeeper yet*). # However , in the same time, the controller-event-thread is handling Startup type of ControllerEvent, as bellow: {code:java} //KafkaController.scala case object Startup extends ControllerEvent { def state = ControllerState.ControllerChange override def process(): Unit = { zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler) elect() } } {code} In this process, it is blocked at ZookeeperClient.handleRequests, code as bellow: {code:java} //代码占位符 def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] = { if (requests.isEmpty) Seq.empty else { val countDownLatch = new CountDownLatch(requests.size) val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size) requests.foreach { request => inFlightRequests.acquire() try { inReadLock(initializationLock) { send(request) { response => responseQueue.add(response) inFlightRequests.release() countDownLatch.countDown() } } } catch { case e: Throwable => inFlightRequests.release() throw e } } countDownLatch.await() responseQueue.asScala.toBuffer } } {code} At this time, Zookeeper server is running normally, and all function is working well. I suspect the key point is the *initializationLock* , because if isReadLock is success, it will throw exception when kafka disconnected with zookeeper, but we never see any exception log in the kafka server.log controller-event-thread can not doing inReadLock(initializationLock) perhaps due to writeLock was not been release. There are tow way to donging inWriteLock(initailizationLock): 1. close(): {code:java} def close(): Unit = { info("Closing.") inWriteLock(initializationLock) { zNodeChangeHandlers.clear() zNodeChildChangeHandlers.clear() stateChangeHandlers.clear() zooKeeper.close() metricNames.foreach(removeMetric(_)) } // Shutdown scheduler outside of lock to avoid deadlock if scheduler // is waiting for lock to process session expiry expiryScheduler.shutdown() info("Closed.") } {code} 2. reinitialize() {code:java} private def reinitialize(): Unit = { // Initialization callbacks are invoked outside of the lock to avoid deadlock potential since their completion // may require additional Zookeeper requests, which will block to acquire the initialization lock stateChangeHandlers.values.foreach(callBeforeInitializingSession _) // zk-session-expiry-handler0 is blocked here inWriteLock(initializationLock) {// program is not running here if (!connectionState.isAlive) { zooKeeper.close() info(s"Initializing a new session to $connectString.") // retry forever until ZooKeeper can be instantiated var connected = false while (!connected) { try { zooKeeper = new ZooKeeper(connectString, sessionTimeoutMs, ZooKeeperClientWatcher) connected = true } catch { case e: Exception => info("Error when recreating ZooKeeper, retrying after a short sleep", e) Thread.sleep(1000) } } } } stateChangeHandlers.values.foreach(callAfterInitializingSession _) } {code} so i think the problem is occurred in function close() was (Author: lbdai3190): Here is some analysis of mine according to the jstack log and source code: When the deadlock occurred, # The zk-session-expiry-handler0 thread is reinitializing a new zookeeper instance, the expireEvent was put to queue of ControllerEventManager and waitting to handle by controller-event-handle thread. (*In the time, kafka is disconnected with zookeeper yet*). # However , in the same time, the controller-event-thread is handling Startup type of ControllerEvent, as bellow: {code:java} //KafkaController.scala case object Startup extends ControllerEvent { def state = ControllerState.ControllerChange override def process(): Unit = { zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler) elect() } } {code} In this process, it is blocked at ZookeeperClient.handleRequests, code as bellow: {code:java} //代码占位符 def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] = { if (req
[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862668#comment-16862668 ] leibo edited comment on KAFKA-8532 at 6/13/19 3:34 AM: --- Here is some analysis of mine according to the jstack log and source code: When the deadlock occurred, # The zk-session-expiry-handler0 thread is reinitializing a new zookeeper instance, the expireEvent was put to queue of ControllerEventManager and waitting to handle by controller-event-handle thread. (*In the time, kafka is disconnected with zookeeper yet*). # However , in the same time, the controller-event-thread is handling Startup type of ControllerEvent, as bellow: {code:java} //KafkaController.scala case object Startup extends ControllerEvent { def state = ControllerState.ControllerChange override def process(): Unit = { zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler) elect() } } {code} In this process, it is blocked at ZookeeperClient.handleRequests, code as bellow: {code:java} //代码占位符 def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] = { if (requests.isEmpty) Seq.empty else { val countDownLatch = new CountDownLatch(requests.size) val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size) requests.foreach { request => inFlightRequests.acquire() try { inReadLock(initializationLock) { send(request) { response => responseQueue.add(response) inFlightRequests.release() countDownLatch.countDown() } } } catch { case e: Throwable => inFlightRequests.release() throw e } } countDownLatch.await() responseQueue.asScala.toBuffer } } {code} At this time, Zookeeper server is running normally, and all function is working well. I suspect the key point is the *initializationLock* , because if isReadLock is success, it will throw exception when kafka disconnected with zookeeper, but we never see any exception log in the kafka server.log was (Author: lbdai3190): Here is some analysis of mine according to the jstack log and source code: When the deadlock occurred, # The zk-session-expiry-handler0 thread is reinitializing a new zookeeper instance, the expireEvent was put to queue of ControllerEventManager and waitting to handle by controller-event-handle thread. (*In the time, kafka is disconnected with zookeeper yet*). # However , in the same time, the controller-event-thread is handling Startup type of ControllerEvent, as bellow: {code:java} //KafkaController.scala case object Startup extends ControllerEvent { def state = ControllerState.ControllerChange override def process(): Unit = { zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler) elect() } } {code} In this process, it is blocked at ZookeeperClient.handleRequests, code as bellow: {code:java} //代码占位符 def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] = { if (requests.isEmpty) Seq.empty else { val countDownLatch = new CountDownLatch(requests.size) val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size) requests.foreach { request => inFlightRequests.acquire() try { inReadLock(initializationLock) { send(request) { response => responseQueue.add(response) inFlightRequests.release() countDownLatch.countDown() } } } catch { case e: Throwable => inFlightRequests.release() throw e } } countDownLatch.await() responseQueue.asScala.toBuffer } } {code} At this time, Zookeeper server is running normally, and all function is working well. I suspect the key point is the *initializationLock* , because is isReadLock is success, it will throw exception when kafka disconnected with zookeeper, but we never see any exception log in the kafka server.log > controller-event-thread deadlock with zk-session-expiry-handler0 > > > Key: KAFKA-8532 > URL: https://issues.apache.org/jira/browse/KAFKA-8532 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Blocker > Attachments: js.log > > > We have observed a serious deadlock between controller-event-thead and > zk-session-expirey-handle thread. When this issue occurred, it's only way to > recovery the kafka cluster is restart kafka server. Bellow is the jstack log > of controller-event-thead and zk-session-expiry-handle thread. > "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 > tid=0x7fcc9c01 nid=0xfb22
[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862668#comment-16862668 ] leibo edited comment on KAFKA-8532 at 6/13/19 3:31 AM: --- Here is some analysis of mine according to the jstack log and source code: When the deadlock occurred, # The zk-session-expiry-handler0 thread is reinitializing a new zookeeper instance, the expireEvent was put to queue of ControllerEventManager and waitting to handle by controller-event-handle thread. (*In the time, kafka is disconnected with zookeeper yet*). # However , in the same time, the controller-event-thread is handling Startup type of ControllerEvent, as bellow: {code:java} //KafkaController.scala case object Startup extends ControllerEvent { def state = ControllerState.ControllerChange override def process(): Unit = { zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler) elect() } } {code} In this process, it is blocked at ZookeeperClient.handleRequests, code as bellow: {code:java} //代码占位符 def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] = { if (requests.isEmpty) Seq.empty else { val countDownLatch = new CountDownLatch(requests.size) val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size) requests.foreach { request => inFlightRequests.acquire() try { inReadLock(initializationLock) { send(request) { response => responseQueue.add(response) inFlightRequests.release() countDownLatch.countDown() } } } catch { case e: Throwable => inFlightRequests.release() throw e } } countDownLatch.await() responseQueue.asScala.toBuffer } } {code} At this time, Zookeeper server is running normally, and all function is working well. I suspect the key point is the *initializationLock* , because is isReadLock is success, it will throw exception when kafka disconnected with zookeeper, but we never see any exception log in the kafka server.log was (Author: lbdai3190): Here is some analysis of mine according to the jstack log and source code: When the deadlock occurred, # The zk-session-expiry-handler0 thread is reinitializing a new zookeeper instance, the expireEvent was put to queue of ControllerEventManager and waitting to handle by controller-event-handle thread. (*In the time, kafka is disconnected with zookeeper yet*). # However , in the same time, the controller-event-thread is handling Startup type of ControllerEvent, as bellow: {code:java} //KafkaController.scala case object Startup extends ControllerEvent { def state = ControllerState.ControllerChange override def process(): Unit = { zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler) elect() } } {code} In this process, it is blocked at ZookeeperClient.handleRequests, code as bellow: {code:java} //代码占位符 def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] = { if (requests.isEmpty) Seq.empty else { val countDownLatch = new CountDownLatch(requests.size) val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size) requests.foreach { request => inFlightRequests.acquire() try { inReadLock(initializationLock) { send(request) { response => responseQueue.add(response) inFlightRequests.release() countDownLatch.countDown() } } } catch { case e: Throwable => inFlightRequests.release() throw e } } countDownLatch.await() responseQueue.asScala.toBuffer } } {code} > controller-event-thread deadlock with zk-session-expiry-handler0 > > > Key: KAFKA-8532 > URL: https://issues.apache.org/jira/browse/KAFKA-8532 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Blocker > Attachments: js.log > > > We have observed a serious deadlock between controller-event-thead and > zk-session-expirey-handle thread. When this issue occurred, it's only way to > recovery the kafka cluster is restart kafka server. Bellow is the jstack log > of controller-event-thead and zk-session-expiry-handle thread. > "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 > tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005ee3f7000> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > a
[jira] [Commented] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862668#comment-16862668 ] leibo commented on KAFKA-8532: -- Here is some analysis of mine according to the jstack log and source code: When the deadlock occurred, # The zk-session-expiry-handler0 thread is reinitializing a new zookeeper instance, the expireEvent was put to queue of ControllerEventManager and waitting to handle by controller-event-handle thread. (*In the time, kafka is disconnected with zookeeper yet*). # However , in the same time, the controller-event-thread is handling Startup type of ControllerEvent, as bellow: {code:java} //KafkaController.scala case object Startup extends ControllerEvent { def state = ControllerState.ControllerChange override def process(): Unit = { zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler) elect() } } {code} In this process, it is blocked at ZookeeperClient.handleRequests, code as bellow: {code:java} //代码占位符 def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] = { if (requests.isEmpty) Seq.empty else { val countDownLatch = new CountDownLatch(requests.size) val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size) requests.foreach { request => inFlightRequests.acquire() try { inReadLock(initializationLock) { send(request) { response => responseQueue.add(response) inFlightRequests.release() countDownLatch.countDown() } } } catch { case e: Throwable => inFlightRequests.release() throw e } } countDownLatch.await() responseQueue.asScala.toBuffer } } {code} > controller-event-thread deadlock with zk-session-expiry-handler0 > > > Key: KAFKA-8532 > URL: https://issues.apache.org/jira/browse/KAFKA-8532 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Blocker > Attachments: js.log > > > We have observed a serious deadlock between controller-event-thead and > zk-session-expirey-handle thread. When this issue occurred, it's only way to > recovery the kafka cluster is restart kafka server. Bellow is the jstack log > of controller-event-thead and zk-session-expiry-handle thread. > "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 > tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005ee3f7000> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // > 等待controller-event-thread线程处理expireEvent > at > kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533) > at > kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173) > at > kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374) > at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown > Source) > at scala.collection.Iterator.foreach(Iterator.scala:937) > at scala.collection.Iterator.foreach$(Iterator.scala:937) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1425) > at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209) > at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428) > at > kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown > Source) > at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114) > at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown > Source) > at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:
[jira] [Updated] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leibo updated KAFKA-8532: - Attachment: js.log > controller-event-thread deadlock with zk-session-expiry-handler0 > > > Key: KAFKA-8532 > URL: https://issues.apache.org/jira/browse/KAFKA-8532 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 2.1.1 >Reporter: leibo >Priority: Blocker > Attachments: js.log > > > We have observed a serious deadlock between controller-event-thead and > zk-session-expirey-handle thread. When this issue occurred, it's only way to > recovery the kafka cluster is restart kafka server. Bellow is the jstack log > of controller-event-thead and zk-session-expiry-handle thread. > "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 > tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005ee3f7000> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // > 等待controller-event-thread线程处理expireEvent > at > kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533) > at > kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173) > at > kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374) > at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown > Source) > at scala.collection.Iterator.foreach(Iterator.scala:937) > at scala.collection.Iterator.foreach$(Iterator.scala:937) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1425) > at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209) > at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374) > at > kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428) > at > kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown > Source) > at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114) > at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown > Source) > at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Locked ownable synchronizers: > - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker) > "controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 > nid=0x310 waiting on condition [0x7fccb55c8000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0005d1be5a00> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) > at kafka.zookeeper.ZooKeeperClient.handleRequests(ZooKeeperClient.scala:157) > at > kafka.zk.KafkaZkClient.retryRequestsUntilConnected(KafkaZkClient.scala:1596) > at > kafka.zk.KafkaZkClient.kafka$zk$KafkaZkClient$$retryRequestUnti
[jira] [Updated] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
[ https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leibo updated KAFKA-8532: - Description: We have observed a serious deadlock between controller-event-thead and zk-session-expirey-handle thread. When this issue occurred, it's only way to recovery the kafka cluster is restart kafka server. Bellow is the jstack log of controller-event-thead and zk-session-expiry-handle thread. "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0005ee3f7000> (a java.util.concurrent.CountDownLatch$Sync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // 等待controller-event-thread线程处理expireEvent at kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533) at kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173) at kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408) at kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374) at kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374) at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown Source) at scala.collection.Iterator.foreach(Iterator.scala:937) at scala.collection.Iterator.foreach$(Iterator.scala:937) at scala.collection.AbstractIterator.foreach(Iterator.scala:1425) at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209) at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374) at kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428) at kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown Source) at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114) at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown Source) at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Locked ownable synchronizers: - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker) "controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 nid=0x310 waiting on condition [0x7fccb55c8000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0005d1be5a00> (a java.util.concurrent.CountDownLatch$Sync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) at kafka.zookeeper.ZooKeeperClient.handleRequests(ZooKeeperClient.scala:157) at kafka.zk.KafkaZkClient.retryRequestsUntilConnected(KafkaZkClient.scala:1596) at kafka.zk.KafkaZkClient.kafka$zk$KafkaZkClient$$retryRequestUntilConnected(KafkaZkClient.scala:1589) at kafka.zk.KafkaZkClient.deletePreferredReplicaElection(KafkaZkClient.scala:989) at kafka.controller.KafkaController.removePartitionsFromPreferredReplicaElection(KafkaController.scala:873) at kafka.controller.KafkaController.kafka$controller$KafkaController$$onPreferredReplicaElection(KafkaController.scala:631) at kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:266) at kafka.controller.KafkaController.kafka$controller$KafkaController$$elect(KafkaController.scala:1221) at kafka.controller.KafkaController$Reelect$.process(Kafk
[jira] [Created] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0
leibo created KAFKA-8532: Summary: controller-event-thread deadlock with zk-session-expiry-handler0 Key: KAFKA-8532 URL: https://issues.apache.org/jira/browse/KAFKA-8532 Project: Kafka Issue Type: Bug Components: core Affects Versions: 2.1.1 Reporter: leibo -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (KAFKA-5998) /.checkpoint.tmp Not found exception
[ https://issues.apache.org/jira/browse/KAFKA-5998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16784126#comment-16784126 ] leibo edited comment on KAFKA-5998 at 3/5/19 6:57 AM: -- It seems that {color:#ff} final File temp = new File(file.getAbsolutePath() + ".tmp");{color} will not create new file on some system sometimes. can we do this as bellow before write offset to checkpoint.tmp file. {code:java} //代码占位符 final File temp = new File(file.getAbsolutePath() + ".tmp"); if (!temp.exists()) { // create .checkpoint.tmp file if it's not exists temp.createNewFile(); } final FileOutputStream fileOutputStream = new FileOutputStream(temp); {code} was (Author: lbdai3190): It seems that {color:#FF} final File temp = new File(file.getAbsolutePath() + ".tmp");{color} will not create new file on some system. {code:java} //代码占位符 final File temp = new File(file.getAbsolutePath() + ".tmp"); if (!temp.exists()) { // create .checkpoint.tmp file if it's not exists temp.createNewFile(); } final FileOutputStream fileOutputStream = new FileOutputStream(temp); {code} > /.checkpoint.tmp Not found exception > > > Key: KAFKA-5998 > URL: https://issues.apache.org/jira/browse/KAFKA-5998 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 0.11.0.0, 0.11.0.1, 2.1.1 >Reporter: Yogesh BG >Priority: Major > Attachments: 5998.v1.txt, 5998.v2.txt, Topology.txt, exc.txt, > props.txt, streams.txt > > > I have one kafka broker and one kafka stream running... I am running its > since two days under load of around 2500 msgs per second.. On third day am > getting below exception for some of the partitions, I have 16 partitions only > 0_0 and 0_1 gives this error > {{09:43:25.955 [ks_0_inst-StreamThread-6] WARN > o.a.k.s.p.i.ProcessorStateManager - Failed to write checkpoint file to > /data/kstreams/rtp-kafkastreams/0_1/.checkpoint: > java.io.FileNotFoundException: > /data/kstreams/rtp-kafkastreams/0_1/.checkpoint.tmp (No such file or > directory) > at java.io.FileOutputStream.open(Native Method) ~[na:1.7.0_111] > at java.io.FileOutputStream.(FileOutputStream.java:221) > ~[na:1.7.0_111] > at java.io.FileOutputStream.(FileOutputStream.java:171) > ~[na:1.7.0_111] > at > org.apache.kafka.streams.state.internals.OffsetCheckpoint.write(OffsetCheckpoint.java:73) > ~[rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.ProcessorStateManager.checkpoint(ProcessorStateManager.java:324) > ~[rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.StreamTask$1.run(StreamTask.java:267) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetricsImpl.java:201) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.StreamTask.commit(StreamTask.java:260) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.StreamTask.commit(StreamTask.java:254) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.AssignedTasks$1.apply(AssignedTasks.java:322) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.AssignedTasks.applyToRunningTasks(AssignedTasks.java:415) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.AssignedTasks.commit(AssignedTasks.java:314) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.StreamThread.commitAll(StreamThread.java:700) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.StreamThread.maybeCommit(StreamThread.java:683) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:523) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:480) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:457) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > 09:43:25.974 [ks_0_inst-StreamThread-15] WARN > o.a.k.s.p.i.ProcessorStateManager - Failed to wr
[jira] [Commented] (KAFKA-5998) /.checkpoint.tmp Not found exception
[ https://issues.apache.org/jira/browse/KAFKA-5998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16784126#comment-16784126 ] leibo commented on KAFKA-5998: -- It seems that {color:#FF} final File temp = new File(file.getAbsolutePath() + ".tmp");{color} will not create new file on some system. {code:java} //代码占位符 final File temp = new File(file.getAbsolutePath() + ".tmp"); if (!temp.exists()) { // create .checkpoint.tmp file if it's not exists temp.createNewFile(); } final FileOutputStream fileOutputStream = new FileOutputStream(temp); {code} > /.checkpoint.tmp Not found exception > > > Key: KAFKA-5998 > URL: https://issues.apache.org/jira/browse/KAFKA-5998 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 0.11.0.0, 0.11.0.1, 2.1.1 >Reporter: Yogesh BG >Priority: Major > Attachments: 5998.v1.txt, 5998.v2.txt, Topology.txt, exc.txt, > props.txt, streams.txt > > > I have one kafka broker and one kafka stream running... I am running its > since two days under load of around 2500 msgs per second.. On third day am > getting below exception for some of the partitions, I have 16 partitions only > 0_0 and 0_1 gives this error > {{09:43:25.955 [ks_0_inst-StreamThread-6] WARN > o.a.k.s.p.i.ProcessorStateManager - Failed to write checkpoint file to > /data/kstreams/rtp-kafkastreams/0_1/.checkpoint: > java.io.FileNotFoundException: > /data/kstreams/rtp-kafkastreams/0_1/.checkpoint.tmp (No such file or > directory) > at java.io.FileOutputStream.open(Native Method) ~[na:1.7.0_111] > at java.io.FileOutputStream.(FileOutputStream.java:221) > ~[na:1.7.0_111] > at java.io.FileOutputStream.(FileOutputStream.java:171) > ~[na:1.7.0_111] > at > org.apache.kafka.streams.state.internals.OffsetCheckpoint.write(OffsetCheckpoint.java:73) > ~[rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.ProcessorStateManager.checkpoint(ProcessorStateManager.java:324) > ~[rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.StreamTask$1.run(StreamTask.java:267) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetricsImpl.java:201) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.StreamTask.commit(StreamTask.java:260) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.StreamTask.commit(StreamTask.java:254) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.AssignedTasks$1.apply(AssignedTasks.java:322) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.AssignedTasks.applyToRunningTasks(AssignedTasks.java:415) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.AssignedTasks.commit(AssignedTasks.java:314) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.StreamThread.commitAll(StreamThread.java:700) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.StreamThread.maybeCommit(StreamThread.java:683) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:523) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:480) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > at > org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:457) > [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na] > 09:43:25.974 [ks_0_inst-StreamThread-15] WARN > o.a.k.s.p.i.ProcessorStateManager - Failed to write checkpoint file to > /data/kstreams/rtp-kafkastreams/0_0/.checkpoint: > java.io.FileNotFoundException: > /data/kstreams/rtp-kafkastreams/0_0/.checkpoint.tmp (No such file or > directory) > at java.io.FileOutputStream.open(Native Method) ~[na:1.7.0_111] > at java.io.FileOutputStream.(FileOutputStream.java:221) > ~[na:1.7.0_111] > at java.io.FileOutputStream.(FileOutputStream.java:171) > ~[na:1.7.0_111] > at > org.apache.kafka.streams.state.internals.OffsetCheckpoint.write(OffsetCheckpoint.java:73) > ~[rtp-k
[jira] [Updated] (KAFKA-7684) kafka consumer SchemaException occurred: Error reading field 'brokers':
[ https://issues.apache.org/jira/browse/KAFKA-7684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leibo updated KAFKA-7684: - Description: kafka consumer occurred org.apache.kafka.common.protocol.types.SchemaException: Error reading field 'brokers': Error reading array of size 65535, only 22 bytes available when the broker is not work due to a l*ong time full GC.* *Exception Detail:* org.apache.kafka.common.protocol.types.SchemaException: Error reading field 'brokers': Error reading array of size 65535, only 22 bytes available at org.apache.kafka.common.protocol.types.Schema.read(Schema.java:75) at org.apache.kafka.common.protocol.ApiKeys.parseResponse(ApiKeys.java:153) at org.apache.kafka.clients.NetworkClient.parseStructMaybeUpdateThrottleTimeMetrics(NetworkClient.java:560) at org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:657) at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:442) at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:232) at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:208) at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:184) at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureCoordinatorReady(AbstractCoordinator.java:217) at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureCoordinatorReady(AbstractCoordinator.java:203) at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:286) at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1078) at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1043) at com.zte.ums.zenap.cometd.ReceiveKafkaMessage.run(ReceiveKafkaMessage.java:40) *After this , the following exception occurred:* 2018-11-21 19:43:52 097 ERROR [org.apache.kafka.clients.consumer.KafkaConsumer][Thread-22] - Failed to close coordinator java.lang.IllegalStateException: Correlation id for response (3097236) does not match request (3097235), request header: \{api_key=8,api_version=3,correlation_id=3097235,client_id=pm-task_100.100.0.43_8} at org.apache.kafka.clients.NetworkClient.correlate(NetworkClient.java:752) at org.apache.kafka.clients.NetworkClient.parseStructMaybeUpdateThrottleTimeMetrics(NetworkClient.java:561) at org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:657) at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:442) at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:232) at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:208) at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:199) at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.awaitPendingRequests(ConsumerNetworkClient.java:284) at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.close(AbstractCoordinator.java:706) at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.close(ConsumerCoordinator.java:503) at org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:1605) at org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:1573) at org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:1549) at com.zte.ums.zenap.pm.common.kafka.BasicKafkaConsumerWorker$1.run(BasicKafkaConsumerWorker.java:77) at java.lang.Thread.run(Thread.java:811) 18-11-21 19:43:52 099 ERROR [com.zte.ums.zenap.dw.ext.FrameWorkThreadGroup][Thread-24] - Thread-24 throw this exception org.apache.kafka.common.KafkaException: Failed to close kafka consumer at org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:1623) at org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:1573) at org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:1549) at com.zte.ums.zenap.pm.common.kafka.BasicKafkaConsumerWorker$1.run(BasicKafkaConsumerWorker.java:77) at java.lang.Thread.run(Thread.java:811) Kafka broker version: *0.11.0.1* Kafka consumer version: *0.11.0.1* *More Info:* In the same time , the kafka broker have a long time(about 16 seconds) Full GC , and lots of consumer rebalance and zookeeper session timeout occurred. was: kafka consumer occurred org.apache.kafka.common.protocol.types.SchemaException: Error reading field 'brokers': Error reading array of size 65535, only 22 bytes available when the broker is not work due to a long time full GC. Exception Detail: org.apache.kafka.common.protocol.types.SchemaException: Error reading field 'brokers': Error reading array of size 65535, only 22 bytes available at org.apache.kafka.commo
[jira] [Created] (KAFKA-7684) kafka consumer SchemaException occurred: Error reading field 'brokers':
leibo created KAFKA-7684: Summary: kafka consumer SchemaException occurred: Error reading field 'brokers': Key: KAFKA-7684 URL: https://issues.apache.org/jira/browse/KAFKA-7684 Project: Kafka Issue Type: Bug Components: consumer Affects Versions: 0.11.0.1 Reporter: leibo kafka consumer occurred org.apache.kafka.common.protocol.types.SchemaException: Error reading field 'brokers': Error reading array of size 65535, only 22 bytes available when the broker is not work due to a long time full GC. Exception Detail: org.apache.kafka.common.protocol.types.SchemaException: Error reading field 'brokers': Error reading array of size 65535, only 22 bytes available at org.apache.kafka.common.protocol.types.Schema.read(Schema.java:75) at org.apache.kafka.common.protocol.ApiKeys.parseResponse(ApiKeys.java:153) at org.apache.kafka.clients.NetworkClient.parseStructMaybeUpdateThrottleTimeMetrics(NetworkClient.java:560) at org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:657) at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:442) at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:232) at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:208) at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:184) at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureCoordinatorReady(AbstractCoordinator.java:217) at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureCoordinatorReady(AbstractCoordinator.java:203) at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:286) at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1078) at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1043) at com.zte.ums.zenap.cometd.ReceiveKafkaMessage.run(ReceiveKafkaMessage.java:40) -- This message was sent by Atlassian JIRA (v7.6.3#76005)