[jira] [Updated] (KAFKA-9973) __consumer_offsets record is invalid lead to log clean failed and __consumer_offsets grows too big

2020-05-07 Thread leibo (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leibo updated KAFKA-9973:
-
Attachment: dump.png

> __consumer_offsets record is invalid lead to log clean failed and 
> __consumer_offsets grows too big
> --
>
> Key: KAFKA-9973
> URL: https://issues.apache.org/jira/browse/KAFKA-9973
> Project: Kafka
>  Issue Type: Bug
>  Components: log cleaner
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Major
> Attachments: dump.png, log-cleaner.log.1, screenshot-1.png
>
>
> __consumer_offsets-34 grows to 13GB, and can't be compact successf, the error 
> log as bellow:
> {code:java}
> //代码占位符
> [2019-12-30 19:21:41,047] INFO [kafka-log-cleaner-thread-6]: Starting 
> (kafka.log.LogCleaner)
> [2019-12-30 19:21:41,079] INFO [kafka-log-cleaner-thread-7]: Starting 
> (kafka.log.LogCleaner)
> [2019-12-30 19:21:42,825] WARN [kafka-log-cleaner-thread-0]: Unexpected 
> exception thrown when cleaning log 
> Log(/home/kafka_2.12-2.1.1/data/kafka-logs/__consumer_offsets-34). Marking 
> its partition (__consumer_offsets-34) as uncleanable (kafka.log.LogCleaner)
> org.apache.kafka.common.record.InvalidRecordException: Found invalid number 
> of record headers -47
> --
> [2019-12-30 20:23:51,340] INFO [kafka-log-cleaner-thread-5]: Starting 
> (kafka.log.LogCleaner)
> [2019-12-30 20:23:51,361] INFO Cleaner 0: Building offset map for log 
> __consumer_offsets-34 for 52 segments in offset range [26842714, 73423730). 
> (kafka.log.LogCleaner)
> [2019-12-30 20:23:51,398] INFO [kafka-log-cleaner-thread-6]: Starting 
> (kafka.log.LogCleaner)
> [2019-12-30 20:23:51,493] INFO [kafka-log-cleaner-thread-7]: Starting 
> (kafka.log.LogCleaner)
> [2019-12-30 20:23:59,596] WARN [kafka-log-cleaner-thread-0]: Unexpected 
> exception thrown when cleaning log 
> Log(/home/kafka_2.12-2.1.1/data/kafka-logs/__consumer_offsets-34). Marking 
> its partition (__consumer_offsets-34) as uncleanable (kafka.log.LogCleaner)
> org.apache.kafka.common.record.InvalidRecordException: Found invalid number 
> of record headers -47
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-9973) __consumer_offsets record is invalid lead to log clean failed and __consumer_offsets grows too big

2020-05-07 Thread leibo (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leibo updated KAFKA-9973:
-
Attachment: screenshot-1.png

> __consumer_offsets record is invalid lead to log clean failed and 
> __consumer_offsets grows too big
> --
>
> Key: KAFKA-9973
> URL: https://issues.apache.org/jira/browse/KAFKA-9973
> Project: Kafka
>  Issue Type: Bug
>  Components: log cleaner
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Major
> Attachments: log-cleaner.log.1, screenshot-1.png
>
>
> __consumer_offsets-34 grows to 13GB, and can't be compact successf, the error 
> log as bellow:
> {code:java}
> //代码占位符
> [2019-12-30 19:21:41,047] INFO [kafka-log-cleaner-thread-6]: Starting 
> (kafka.log.LogCleaner)
> [2019-12-30 19:21:41,079] INFO [kafka-log-cleaner-thread-7]: Starting 
> (kafka.log.LogCleaner)
> [2019-12-30 19:21:42,825] WARN [kafka-log-cleaner-thread-0]: Unexpected 
> exception thrown when cleaning log 
> Log(/home/kafka_2.12-2.1.1/data/kafka-logs/__consumer_offsets-34). Marking 
> its partition (__consumer_offsets-34) as uncleanable (kafka.log.LogCleaner)
> org.apache.kafka.common.record.InvalidRecordException: Found invalid number 
> of record headers -47
> --
> [2019-12-30 20:23:51,340] INFO [kafka-log-cleaner-thread-5]: Starting 
> (kafka.log.LogCleaner)
> [2019-12-30 20:23:51,361] INFO Cleaner 0: Building offset map for log 
> __consumer_offsets-34 for 52 segments in offset range [26842714, 73423730). 
> (kafka.log.LogCleaner)
> [2019-12-30 20:23:51,398] INFO [kafka-log-cleaner-thread-6]: Starting 
> (kafka.log.LogCleaner)
> [2019-12-30 20:23:51,493] INFO [kafka-log-cleaner-thread-7]: Starting 
> (kafka.log.LogCleaner)
> [2019-12-30 20:23:59,596] WARN [kafka-log-cleaner-thread-0]: Unexpected 
> exception thrown when cleaning log 
> Log(/home/kafka_2.12-2.1.1/data/kafka-logs/__consumer_offsets-34). Marking 
> its partition (__consumer_offsets-34) as uncleanable (kafka.log.LogCleaner)
> org.apache.kafka.common.record.InvalidRecordException: Found invalid number 
> of record headers -47
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-9973) __consumer_offsets record is invalid lead to log clean failed and __consumer_offsets grows too big

2020-05-07 Thread leibo (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leibo updated KAFKA-9973:
-
Description: 
__consumer_offsets-34 grows to 13GB, and can't be compact successf, the error 
log as bellow:
{code:java}
//代码占位符
[2019-12-30 19:21:41,047] INFO [kafka-log-cleaner-thread-6]: Starting 
(kafka.log.LogCleaner)
[2019-12-30 19:21:41,079] INFO [kafka-log-cleaner-thread-7]: Starting 
(kafka.log.LogCleaner)
[2019-12-30 19:21:42,825] WARN [kafka-log-cleaner-thread-0]: Unexpected 
exception thrown when cleaning log 
Log(/home/kafka_2.12-2.1.1/data/kafka-logs/__consumer_offsets-34). Marking its 
partition (__consumer_offsets-34) as uncleanable (kafka.log.LogCleaner)
org.apache.kafka.common.record.InvalidRecordException: Found invalid number of 
record headers -47
--
[2019-12-30 20:23:51,340] INFO [kafka-log-cleaner-thread-5]: Starting 
(kafka.log.LogCleaner)
[2019-12-30 20:23:51,361] INFO Cleaner 0: Building offset map for log 
__consumer_offsets-34 for 52 segments in offset range [26842714, 73423730). 
(kafka.log.LogCleaner)
[2019-12-30 20:23:51,398] INFO [kafka-log-cleaner-thread-6]: Starting 
(kafka.log.LogCleaner)
[2019-12-30 20:23:51,493] INFO [kafka-log-cleaner-thread-7]: Starting 
(kafka.log.LogCleaner)
[2019-12-30 20:23:59,596] WARN [kafka-log-cleaner-thread-0]: Unexpected 
exception thrown when cleaning log 
Log(/home/kafka_2.12-2.1.1/data/kafka-logs/__consumer_offsets-34). Marking its 
partition (__consumer_offsets-34) as uncleanable (kafka.log.LogCleaner)
org.apache.kafka.common.record.InvalidRecordException: Found invalid number of 
record headers -47

{code}

  was:
{code:java}
//代码占位符
[2019-12-30 19:21:41,047] INFO [kafka-log-cleaner-thread-6]: Starting 
(kafka.log.LogCleaner)
[2019-12-30 19:21:41,079] INFO [kafka-log-cleaner-thread-7]: Starting 
(kafka.log.LogCleaner)
[2019-12-30 19:21:42,825] WARN [kafka-log-cleaner-thread-0]: Unexpected 
exception thrown when cleaning log 
Log(/home/kafka_2.12-2.1.1/data/kafka-logs/__consumer_offsets-34). Marking its 
partition (__consumer_offsets-34) as uncleanable (kafka.log.LogCleaner)
org.apache.kafka.common.record.InvalidRecordException: Found invalid number of 
record headers -47
--
[2019-12-30 20:23:51,340] INFO [kafka-log-cleaner-thread-5]: Starting 
(kafka.log.LogCleaner)
[2019-12-30 20:23:51,361] INFO Cleaner 0: Building offset map for log 
__consumer_offsets-34 for 52 segments in offset range [26842714, 73423730). 
(kafka.log.LogCleaner)
[2019-12-30 20:23:51,398] INFO [kafka-log-cleaner-thread-6]: Starting 
(kafka.log.LogCleaner)
[2019-12-30 20:23:51,493] INFO [kafka-log-cleaner-thread-7]: Starting 
(kafka.log.LogCleaner)
[2019-12-30 20:23:59,596] WARN [kafka-log-cleaner-thread-0]: Unexpected 
exception thrown when cleaning log 
Log(/home/kafka_2.12-2.1.1/data/kafka-logs/__consumer_offsets-34). Marking its 
partition (__consumer_offsets-34) as uncleanable (kafka.log.LogCleaner)
org.apache.kafka.common.record.InvalidRecordException: Found invalid number of 
record headers -47

{code}


> __consumer_offsets record is invalid lead to log clean failed and 
> __consumer_offsets grows too big
> --
>
> Key: KAFKA-9973
> URL: https://issues.apache.org/jira/browse/KAFKA-9973
> Project: Kafka
>  Issue Type: Bug
>  Components: log cleaner
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Major
> Attachments: log-cleaner.log.1
>
>
> __consumer_offsets-34 grows to 13GB, and can't be compact successf, the error 
> log as bellow:
> {code:java}
> //代码占位符
> [2019-12-30 19:21:41,047] INFO [kafka-log-cleaner-thread-6]: Starting 
> (kafka.log.LogCleaner)
> [2019-12-30 19:21:41,079] INFO [kafka-log-cleaner-thread-7]: Starting 
> (kafka.log.LogCleaner)
> [2019-12-30 19:21:42,825] WARN [kafka-log-cleaner-thread-0]: Unexpected 
> exception thrown when cleaning log 
> Log(/home/kafka_2.12-2.1.1/data/kafka-logs/__consumer_offsets-34). Marking 
> its partition (__consumer_offsets-34) as uncleanable (kafka.log.LogCleaner)
> org.apache.kafka.common.record.InvalidRecordException: Found invalid number 
> of record headers -47
> --
> [2019-12-30 20:23:51,340] INFO [kafka-log-cleaner-thread-5]: Starting 
> (kafka.log.LogCleaner)
> [2019-12-30 20:23:51,361] INFO Cleaner 0: Building offset map for log 
> __consumer_offsets-34 for 52 segments in offset range [26842714, 73423730). 
> (kafka.log.LogCleaner)
> [2019-12-30 20:23:51,398] INFO [kafka-log-cleaner-thread-6]: Starting 
> (kafka.log.LogCleaner)
> [2019-12-30 20:23:51,493] INFO [kafka-log-cleaner-thread-7]: Starting 
> (kafka.log.LogCleaner)
> [2019-12-30 20:23:59,596] WARN [kafka-log-cleaner-thread-0]: Unexpected 
> exception thrown when cleaning log 
> Log(/home/kafka_2.12-2.1.1/data/kafka-logs/__consumer_offsets-34). Marking 
> its partition (__

[jira] [Updated] (KAFKA-9973) __consumer_offsets record is invalid lead to log clean failed and __consumer_offsets grows too big

2020-05-07 Thread leibo (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leibo updated KAFKA-9973:
-
Attachment: log-cleaner.log.1

> __consumer_offsets record is invalid lead to log clean failed and 
> __consumer_offsets grows too big
> --
>
> Key: KAFKA-9973
> URL: https://issues.apache.org/jira/browse/KAFKA-9973
> Project: Kafka
>  Issue Type: Bug
>  Components: log cleaner
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Major
> Attachments: log-cleaner.log.1
>
>
> {code:java}
> //代码占位符
> [2019-12-30 19:21:41,047] INFO [kafka-log-cleaner-thread-6]: Starting 
> (kafka.log.LogCleaner)
> [2019-12-30 19:21:41,079] INFO [kafka-log-cleaner-thread-7]: Starting 
> (kafka.log.LogCleaner)
> [2019-12-30 19:21:42,825] WARN [kafka-log-cleaner-thread-0]: Unexpected 
> exception thrown when cleaning log 
> Log(/home/kafka_2.12-2.1.1/data/kafka-logs/__consumer_offsets-34). Marking 
> its partition (__consumer_offsets-34) as uncleanable (kafka.log.LogCleaner)
> org.apache.kafka.common.record.InvalidRecordException: Found invalid number 
> of record headers -47
> --
> [2019-12-30 20:23:51,340] INFO [kafka-log-cleaner-thread-5]: Starting 
> (kafka.log.LogCleaner)
> [2019-12-30 20:23:51,361] INFO Cleaner 0: Building offset map for log 
> __consumer_offsets-34 for 52 segments in offset range [26842714, 73423730). 
> (kafka.log.LogCleaner)
> [2019-12-30 20:23:51,398] INFO [kafka-log-cleaner-thread-6]: Starting 
> (kafka.log.LogCleaner)
> [2019-12-30 20:23:51,493] INFO [kafka-log-cleaner-thread-7]: Starting 
> (kafka.log.LogCleaner)
> [2019-12-30 20:23:59,596] WARN [kafka-log-cleaner-thread-0]: Unexpected 
> exception thrown when cleaning log 
> Log(/home/kafka_2.12-2.1.1/data/kafka-logs/__consumer_offsets-34). Marking 
> its partition (__consumer_offsets-34) as uncleanable (kafka.log.LogCleaner)
> org.apache.kafka.common.record.InvalidRecordException: Found invalid number 
> of record headers -47
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-9973) __consumer_offsets record is invalid lead to log clean failed and __consumer_offsets grows too big

2020-05-07 Thread leibo (Jira)
leibo created KAFKA-9973:


 Summary: __consumer_offsets record is invalid lead to log clean 
failed and __consumer_offsets grows too big
 Key: KAFKA-9973
 URL: https://issues.apache.org/jira/browse/KAFKA-9973
 Project: Kafka
  Issue Type: Bug
  Components: log cleaner
Affects Versions: 2.1.1
Reporter: leibo


{code:java}
//代码占位符
[2019-12-30 19:21:41,047] INFO [kafka-log-cleaner-thread-6]: Starting 
(kafka.log.LogCleaner)
[2019-12-30 19:21:41,079] INFO [kafka-log-cleaner-thread-7]: Starting 
(kafka.log.LogCleaner)
[2019-12-30 19:21:42,825] WARN [kafka-log-cleaner-thread-0]: Unexpected 
exception thrown when cleaning log 
Log(/home/kafka_2.12-2.1.1/data/kafka-logs/__consumer_offsets-34). Marking its 
partition (__consumer_offsets-34) as uncleanable (kafka.log.LogCleaner)
org.apache.kafka.common.record.InvalidRecordException: Found invalid number of 
record headers -47
--
[2019-12-30 20:23:51,340] INFO [kafka-log-cleaner-thread-5]: Starting 
(kafka.log.LogCleaner)
[2019-12-30 20:23:51,361] INFO Cleaner 0: Building offset map for log 
__consumer_offsets-34 for 52 segments in offset range [26842714, 73423730). 
(kafka.log.LogCleaner)
[2019-12-30 20:23:51,398] INFO [kafka-log-cleaner-thread-6]: Starting 
(kafka.log.LogCleaner)
[2019-12-30 20:23:51,493] INFO [kafka-log-cleaner-thread-7]: Starting 
(kafka.log.LogCleaner)
[2019-12-30 20:23:59,596] WARN [kafka-log-cleaner-thread-0]: Unexpected 
exception thrown when cleaning log 
Log(/home/kafka_2.12-2.1.1/data/kafka-logs/__consumer_offsets-34). Marking its 
partition (__consumer_offsets-34) as uncleanable (kafka.log.LogCleaner)
org.apache.kafka.common.record.InvalidRecordException: Found invalid number of 
record headers -47

{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9813) __consumer_offsets loaded cost very long time

2020-04-08 Thread leibo (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17077968#comment-17077968
 ] 

leibo commented on KAFKA-9813:
--

[~cricket007] 

it's on a production environment and i could not access now.

After i restart kafka cluster, it recovery to normal.

there was no exception found in server.log

Similar to  kafka-9065.

> __consumer_offsets loaded cost very long time
> -
>
> Key: KAFKA-9813
> URL: https://issues.apache.org/jira/browse/KAFKA-9813
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Major
> Attachments: 2020-04-03_163556.png
>
>
> __consumer_offsets loaded with a long time.
>  
> After I restart kafka(3 HA), one of __consumer_offsets loaded with  41256909 
> ms (about 11 hours),  the others __consumer_offsets partition that load after 
> __consumer_offsets-14 can not be loaded , that lead many consumer can not 
> commit offsets and consume.
> restart time:  2020-04-02 19:06
> __consumer_offsets-14 loaded time: 2020-04-03 06:32,  41256909 ms
> other info: 
>     1.  there are 72408 consumer groups in kafka cluster, and most of it 
> (70498) are empty.
>     2. there was no exception can be found in server.log
>    3. the disk i/o and cpu ,memory load are very low.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9813) __consumer_offsets loaded cost very long time

2020-04-07 Thread leibo (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17077887#comment-17077887
 ] 

leibo commented on KAFKA-9813:
--

[~cricket007]

log.retention.bytes=104857600

log.segment.bytes=104857600

> __consumer_offsets loaded cost very long time
> -
>
> Key: KAFKA-9813
> URL: https://issues.apache.org/jira/browse/KAFKA-9813
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Major
> Attachments: 2020-04-03_163556.png
>
>
> __consumer_offsets loaded with a long time.
>  
> After I restart kafka(3 HA), one of __consumer_offsets loaded with  41256909 
> ms (about 11 hours),  the others __consumer_offsets partition that load after 
> __consumer_offsets-14 can not be loaded , that lead many consumer can not 
> commit offsets and consume.
> restart time:  2020-04-02 19:06
> __consumer_offsets-14 loaded time: 2020-04-03 06:32,  41256909 ms
> other info: 
>     1.  there are 72408 consumer groups in kafka cluster, and most of it 
> (70498) are empty.
>     2. there was no exception can be found in server.log
>    3. the disk i/o and cpu ,memory load are very low.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-9813) __consumer_offsets loaded cost very long time

2020-04-03 Thread leibo (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leibo updated KAFKA-9813:
-
Description: 
__consumer_offsets loaded with a long time.

 

After I restart kafka(3 HA), one of __consumer_offsets loaded with  41256909 ms 
(about 11 hours),  the others __consumer_offsets partition that load after 
__consumer_offsets-14 can not be loaded , that lead many consumer can not 
commit offsets and consume.

restart time:  2020-04-02 19:06

__consumer_offsets-14 loaded time: 2020-04-03 06:32,  41256909 ms

other info: 

    1.  there are 72408 consumer groups in kafka cluster, and most of it 
(70498) are empty.

    2. there was no exception can be found in server.log

   3. the disk i/o and cpu ,memory load are very low.
  

  was:
__consumer_offsets loaded with a long time.

 

After I restart kafka(3 HA), one of __consumer_offsets loaded with  41256909 ms 
(about 11 hours),  the others __consumer_offsets partition that load after 
__consumer_offsets-14 can not be loaded , that lead many consumer can not 
commit offsets and consume.

restart time:  2020-04-02 19:06

__consumer_offsets-14 loaded time: 2020-04-03 06:32,  41256909 ms

other info: 

    1.  there are 72408 consumer group in kafka cluster, and most of it (70498) 
are empty.

    2. there was no exception can be found in server.log

   3. the disk i/o and cpu ,memory load are very low.
  


> __consumer_offsets loaded cost very long time
> -
>
> Key: KAFKA-9813
> URL: https://issues.apache.org/jira/browse/KAFKA-9813
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Major
> Attachments: 2020-04-03_163556.png
>
>
> __consumer_offsets loaded with a long time.
>  
> After I restart kafka(3 HA), one of __consumer_offsets loaded with  41256909 
> ms (about 11 hours),  the others __consumer_offsets partition that load after 
> __consumer_offsets-14 can not be loaded , that lead many consumer can not 
> commit offsets and consume.
> restart time:  2020-04-02 19:06
> __consumer_offsets-14 loaded time: 2020-04-03 06:32,  41256909 ms
> other info: 
>     1.  there are 72408 consumer groups in kafka cluster, and most of it 
> (70498) are empty.
>     2. there was no exception can be found in server.log
>    3. the disk i/o and cpu ,memory load are very low.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-9813) __consumer_offsets loaded cost very long time

2020-04-03 Thread leibo (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leibo updated KAFKA-9813:
-
Description: 
__consumer_offsets loaded with a long time.

 

After I restart kafka(3 HA), one of __consumer_offsets loaded with  41256909 ms 
(about 11 hours),  the others __consumer_offsets partition that load after 
__consumer_offsets-14 can not be loaded , that lead many consumer can not 
commit offsets and consume.

restart time:  2020-04-02 19:06

__consumer_offsets-14 loaded time: 2020-04-03 06:32,  41256909 ms

other info: 

    1.  there are 72408 consumer group in kafka cluster, and most of it (70498) 
are empty.

    2. there was no exception can be found in server.log

   3. the disk i/o and cpu ,memory load are very low.
  

  was:
__consumer_offsets loaded with a long time.

 

After I restart kafka(3 HA), one of __consumer_offsets loaded with  41256909 ms 
(about 11 hours), and many consumer can not commit offsets.

restart time:  2020-04-02 19:06

__consumer_offsets-14 loaded time: 2020-04-03 06:32,  41256909 ms

other info: 

    1.  there are 72408 consumer group in kafka cluster, and most of it (70498) 
are empty.

    2. there was no exception can be found in server.log

   3. the disk i/o and cpu ,memory load are very low.
  


> __consumer_offsets loaded cost very long time
> -
>
> Key: KAFKA-9813
> URL: https://issues.apache.org/jira/browse/KAFKA-9813
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Major
> Attachments: 2020-04-03_163556.png
>
>
> __consumer_offsets loaded with a long time.
>  
> After I restart kafka(3 HA), one of __consumer_offsets loaded with  41256909 
> ms (about 11 hours),  the others __consumer_offsets partition that load after 
> __consumer_offsets-14 can not be loaded , that lead many consumer can not 
> commit offsets and consume.
> restart time:  2020-04-02 19:06
> __consumer_offsets-14 loaded time: 2020-04-03 06:32,  41256909 ms
> other info: 
>     1.  there are 72408 consumer group in kafka cluster, and most of it 
> (70498) are empty.
>     2. there was no exception can be found in server.log
>    3. the disk i/o and cpu ,memory load are very low.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-9813) __consumer_offsets loaded cost very long time

2020-04-03 Thread leibo (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leibo updated KAFKA-9813:
-
Summary: __consumer_offsets loaded cost very long time  (was: 
__consumer_offsets loaded with a very long time)

> __consumer_offsets loaded cost very long time
> -
>
> Key: KAFKA-9813
> URL: https://issues.apache.org/jira/browse/KAFKA-9813
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Major
> Attachments: 2020-04-03_163556.png
>
>
> __consumer_offsets loaded with a long time.
>  
> After I restart kafka(3 HA), one of __consumer_offsets loaded with  41256909 
> ms (about 11 hours), and many consumer can not commit offsets.
> restart time:  2020-04-02 19:06
> __consumer_offsets-14 loaded time: 2020-04-03 06:32,  41256909 ms
> other info: 
>     1.  there are 72408 consumer group in kafka cluster, and most of it 
> (70498) are empty.
>     2. there was no exception can be found in server.log
>    3. the disk i/o and cpu ,memory load are very low.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-9813) __consumer_offsets loaded with a very long time

2020-04-03 Thread leibo (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leibo updated KAFKA-9813:
-
Description: 
__consumer_offsets loaded with a long time.

 

After I restart kafka(3 HA), one of __consumer_offsets loaded with  41256909 ms 
(about 11 hours), and many consumer can not commit offsets.

restart time:  2020-04-02 19:06

__consumer_offsets-14 loaded time: 2020-04-03 06:32,  41256909 ms

other info: 

    1.  there are 72408 consumer group in kafka cluster, and most of it (70498) 
are empty.

    2. there was no exception can be found in server.log

   3. the disk i/o and cpu ,memory load are very low.
  

  was:
__consumer_offsets loaded with a long time.

 

After I restart kafka(3 HA), one of __consumer_offsets loaded with  41256909 ms 
(about 11 hours), and many consumer can not commit offsets.

restart time:  2020-04-02 19:06

__consumer_offsets-14 loaded time: 2020-04-03 06:32,  41256909 ms

other info: 

    1.  there are 72408 consumer group in kafka cluster, and most of it (70498) 
are empty.

    2. there was no exception can be found in server.log
 


> __consumer_offsets loaded with a very long time
> ---
>
> Key: KAFKA-9813
> URL: https://issues.apache.org/jira/browse/KAFKA-9813
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Major
> Attachments: 2020-04-03_163556.png
>
>
> __consumer_offsets loaded with a long time.
>  
> After I restart kafka(3 HA), one of __consumer_offsets loaded with  41256909 
> ms (about 11 hours), and many consumer can not commit offsets.
> restart time:  2020-04-02 19:06
> __consumer_offsets-14 loaded time: 2020-04-03 06:32,  41256909 ms
> other info: 
>     1.  there are 72408 consumer group in kafka cluster, and most of it 
> (70498) are empty.
>     2. there was no exception can be found in server.log
>    3. the disk i/o and cpu ,memory load are very low.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-9813) __consumer_offsets loaded with a very long time

2020-04-03 Thread leibo (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leibo updated KAFKA-9813:
-
Attachment: 2020-04-03_163556.png

> __consumer_offsets loaded with a very long time
> ---
>
> Key: KAFKA-9813
> URL: https://issues.apache.org/jira/browse/KAFKA-9813
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Major
> Attachments: 2020-04-03_163556.png
>
>
> __consumer_offsets loaded with a long time.
>  
> After I restart kafka(3 HA), one of __consumer_offsets loaded with  41256909 
> ms (about 11 hours), and many consumer can not commit offsets.
> restart time:  2020-04-02 19:06
> __consumer_offsets-14 loaded time: 2020-04-03 06:32,  41256909 ms
> other info: 
>      there are 72408 consumer group in kafka cluster, and most of it (70498) 
> are empty.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-9813) __consumer_offsets loaded with a very long time

2020-04-03 Thread leibo (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leibo updated KAFKA-9813:
-
Description: 
__consumer_offsets loaded with a long time.

 

After I restart kafka(3 HA), one of __consumer_offsets loaded with  41256909 ms 
(about 11 hours), and many consumer can not commit offsets.

restart time:  2020-04-02 19:06

__consumer_offsets-14 loaded time: 2020-04-03 06:32,  41256909 ms

other info: 

    1.  there are 72408 consumer group in kafka cluster, and most of it (70498) 
are empty.

    2. there was no exception can be found in server.log
 

  was:
__consumer_offsets loaded with a long time.

 

After I restart kafka(3 HA), one of __consumer_offsets loaded with  41256909 ms 
(about 11 hours), and many consumer can not commit offsets.

restart time:  2020-04-02 19:06

__consumer_offsets-14 loaded time: 2020-04-03 06:32,  41256909 ms

other info: 

     there are 72408 consumer group in kafka cluster, and most of it (70498) 
are empty.


> __consumer_offsets loaded with a very long time
> ---
>
> Key: KAFKA-9813
> URL: https://issues.apache.org/jira/browse/KAFKA-9813
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Major
> Attachments: 2020-04-03_163556.png
>
>
> __consumer_offsets loaded with a long time.
>  
> After I restart kafka(3 HA), one of __consumer_offsets loaded with  41256909 
> ms (about 11 hours), and many consumer can not commit offsets.
> restart time:  2020-04-02 19:06
> __consumer_offsets-14 loaded time: 2020-04-03 06:32,  41256909 ms
> other info: 
>     1.  there are 72408 consumer group in kafka cluster, and most of it 
> (70498) are empty.
>     2. there was no exception can be found in server.log
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-9813) __consumer_offsets loaded with a very long time

2020-04-03 Thread leibo (Jira)
leibo created KAFKA-9813:


 Summary: __consumer_offsets loaded with a very long time
 Key: KAFKA-9813
 URL: https://issues.apache.org/jira/browse/KAFKA-9813
 Project: Kafka
  Issue Type: Bug
  Components: core
Affects Versions: 2.1.1
Reporter: leibo


__consumer_offsets loaded with a long time.

 

After I restart kafka(3 HA), one of __consumer_offsets loaded with  41256909 ms 
(about 11 hours), and many consumer can not commit offsets.

restart time:  2020-04-02 19:06

__consumer_offsets-14 loaded time: 2020-04-03 06:32,  41256909 ms

other info: 

     there are 72408 consumer group in kafka cluster, and most of it (70498) 
are empty.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2020-01-21 Thread leibo (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17020027#comment-17020027
 ] 

leibo edited comment on KAFKA-8532 at 1/22/20 12:56 AM:


[~junrao] I do a further analysis as below:

zk client heartbeat thread org.apache.zookeeper.ClientCnxn$SendThread
{code:java}
SendThread(ClientCnxnSocket clientCnxnSocket) {
super(ClientCnxn.makeThreadName("-SendThread()"));
ClientCnxn.this.state = States.CONNECTING;//init state
this.clientCnxnSocket = clientCnxnSocket;
this.setDaemon(true);
}
{code}
As we can see, zk heartbeat thread SendThread init state is *CONNECTING*,

And when zk session expired, zookeeper.getState() is *CONNECTING*.
{code:java}
//代码占位符
public enum States {
CONNECTING, ASSOCIATING, CONNECTED, CONNECTEDREADONLY,
CLOSED, AUTH_FAILED, NOT_CONNECTED;

public boolean isAlive() {  // when zk client state is CONNECTING 
or  NOT_CONNECTED, obviously, zk session is not valid, we can't regard as 
zookeeper is Alive
return this != CLOSED && this != AUTH_FAILED;
}
public boolean isConnected() {
return this == CONNECTED || this == CONNECTEDREADONLY;
}
}
{code}
According to above code, we can see that when zk session expired, 
ClientCnxn.SendThread will attempt to reconnect to zk sever and establish a new 
session.
 It's init state is CONNECTING, if zk-session-expired-thread enter to 
reinitialize method with zookeeper state CONNECTING, it will skip if 
*(!connectionState.isAlive)* condition and execute callAfterInitializingSession 
, trigger controller-event-thread process RegisterBrokerAndReelect,  while zk 
session is invalid now.


was (Author: lbdai3190):
[~junrao] I do a further analysis as below:

zk client heartbeat thread org.apache.zookeeper.ClientCnxn$SendThread
{code:java}
SendThread(ClientCnxnSocket clientCnxnSocket) {
super(ClientCnxn.makeThreadName("-SendThread()"));
ClientCnxn.this.state = States.CONNECTING;//init state
this.clientCnxnSocket = clientCnxnSocket;
this.setDaemon(true);
}
{code}
As we can see, zk heartbeat thread SendThread init state is *CONNECTING*,
{code:java}
//代码占位符
public enum States {
CONNECTING, ASSOCIATING, CONNECTED, CONNECTEDREADONLY,
CLOSED, AUTH_FAILED, NOT_CONNECTED;

public boolean isAlive() {  // when zk client state is CONNECTING 
or  NOT_CONNECTED, obviously, zk session is not valid, we can't regard as 
zookeeper is Alive
return this != CLOSED && this != AUTH_FAILED;
}
public boolean isConnected() {
return this == CONNECTED || this == CONNECTEDREADONLY;
}
}
{code}
According to above code, we can see that when zk session expired, 
ClientCnxn.SendThread will attempt to reconnect to zk sever and establish a new 
session.
 It's init state is CONNECTING, if zk-session-expired-thread enter to 
reinitialize method with zookeeper state CONNECTING, it will skip if 
*(!connectionState.isAlive)* condition and execute callAfterInitializingSession 
, trigger controller-event-thread process RegisterBrokerAndReelect,  while zk 
session is invalid now.

> controller-event-thread deadlock with zk-session-expiry-handler0
> 
>
> Key: KAFKA-8532
> URL: https://issues.apache.org/jira/browse/KAFKA-8532
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Blocker
> Attachments: controllerDeadLockAnalysis-2020-01-20.png, js.log, 
> js0.log, js1.log, js2.log
>
>
> We have observed a serious deadlock between controller-event-thead and 
> zk-session-expirey-handle thread. When this issue occurred, it's only one way 
> to recovery the kafka cluster is restart kafka server. The  follows is the 
> jstack log of controller-event-thead and zk-session-expiry-handle thread.
> "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 
> tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005ee3f7000> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:23

[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2020-01-21 Thread leibo (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17020027#comment-17020027
 ] 

leibo edited comment on KAFKA-8532 at 1/21/20 8:51 AM:
---

[~junrao] I do a further analysis as below:

zk client heartbeat thread org.apache.zookeeper.ClientCnxn$SendThread
{code:java}
SendThread(ClientCnxnSocket clientCnxnSocket) {
super(ClientCnxn.makeThreadName("-SendThread()"));
ClientCnxn.this.state = States.CONNECTING;//init state
this.clientCnxnSocket = clientCnxnSocket;
this.setDaemon(true);
}
{code}
As we can see, zk heartbeat thread SendThread init state is *CONNECTING*,
{code:java}
//代码占位符
public enum States {
CONNECTING, ASSOCIATING, CONNECTED, CONNECTEDREADONLY,
CLOSED, AUTH_FAILED, NOT_CONNECTED;

public boolean isAlive() {  // when zk client state is CONNECTING 
or  NOT_CONNECTED, obviously, zk session is not valid, we can't regard as 
zookeeper is Alive
return this != CLOSED && this != AUTH_FAILED;
}
public boolean isConnected() {
return this == CONNECTED || this == CONNECTEDREADONLY;
}
}
{code}
According to above code, we can see that when zk session expired, 
ClientCnxn.SendThread will attempt to reconnect to zk sever and establish a new 
session.
 It's init state is CONNECTING, if zk-session-expired-thread enter to 
reinitialize method with zookeeper state CONNECTING, it will skip if 
*(!connectionState.isAlive)* condition and execute callAfterInitializingSession 
, trigger controller-event-thread process RegisterBrokerAndReelect,  while zk 
session is invalid now.


was (Author: lbdai3190):
zk client heartbeat thread org.apache.zookeeper.ClientCnxn$SendThread
{code:java}
SendThread(ClientCnxnSocket clientCnxnSocket) {
super(ClientCnxn.makeThreadName("-SendThread()"));
ClientCnxn.this.state = States.CONNECTING;//init state
this.clientCnxnSocket = clientCnxnSocket;
this.setDaemon(true);
}
{code}
As we can see, zk heartbeat thread SendThread init state is *CONNECTING*,
{code:java}
//代码占位符
public enum States {
CONNECTING, ASSOCIATING, CONNECTED, CONNECTEDREADONLY,
CLOSED, AUTH_FAILED, NOT_CONNECTED;

public boolean isAlive() {  // when zk client state is CONNECTING 
or  NOT_CONNECTED, obviously, zk session is not valid, we can't regard as 
zookeeper is Alive
return this != CLOSED && this != AUTH_FAILED;
}
public boolean isConnected() {
return this == CONNECTED || this == CONNECTEDREADONLY;
}
}
{code}
According to above code, we can found that when zk session expired, 
ClientCnxn.SendThread will attempt to reconnect to zk sever and establish a new 
session.
 It's init state is CONNECTING, if zk-session-expired-thread enter to 
reinitialize method with zookeeper state CONNECTING, it will skip if 
*(!connectionState.isAlive)* condition and execute callAfterInitializingSession 
, trigger controller-event-thread process RegisterBrokerAndReelect,  while zk 
session is invalid now.

> controller-event-thread deadlock with zk-session-expiry-handler0
> 
>
> Key: KAFKA-8532
> URL: https://issues.apache.org/jira/browse/KAFKA-8532
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Blocker
> Attachments: controllerDeadLockAnalysis-2020-01-20.png, js.log, 
> js0.log, js1.log, js2.log
>
>
> We have observed a serious deadlock between controller-event-thead and 
> zk-session-expirey-handle thread. When this issue occurred, it's only one way 
> to recovery the kafka cluster is restart kafka server. The  follows is the 
> jstack log of controller-event-thead and zk-session-expiry-handle thread.
> "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 
> tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005ee3f7000> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // 
> 等待controller-event-thread线程处理expireEvent
>  at 
> kafka.controller.KafkaController$Expire.waitUntilProce

[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2020-01-21 Thread leibo (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17020027#comment-17020027
 ] 

leibo edited comment on KAFKA-8532 at 1/21/20 8:49 AM:
---

zk client heartbeat thread org.apache.zookeeper.ClientCnxn$SendThread
{code:java}
SendThread(ClientCnxnSocket clientCnxnSocket) {
super(ClientCnxn.makeThreadName("-SendThread()"));
ClientCnxn.this.state = States.CONNECTING;//init state
this.clientCnxnSocket = clientCnxnSocket;
this.setDaemon(true);
}
{code}
As we can see, zk heartbeat thread SendThread init state is *CONNECTING*,
{code:java}
//代码占位符
public enum States {
CONNECTING, ASSOCIATING, CONNECTED, CONNECTEDREADONLY,
CLOSED, AUTH_FAILED, NOT_CONNECTED;

public boolean isAlive() {  // when zk client state is CONNECTING 
or  NOT_CONNECTED, obviously, zk session is not valid, we can't regard as 
zookeeper is Alive
return this != CLOSED && this != AUTH_FAILED;
}
public boolean isConnected() {
return this == CONNECTED || this == CONNECTEDREADONLY;
}
}
{code}
According to above code, we can found that when zk session expired, 
ClientCnxn.SendThread will attempt to reconnect to zk sever and establish a new 
session.
 It's init state is CONNECTING, if zk-session-expired-thread enter to 
reinitialize method with zookeeper state CONNECTING, it will skip if 
*(!connectionState.isAlive)* condition and execute callAfterInitializingSession 
, trigger controller-event-thread process RegisterBrokerAndReelect,  while zk 
session is invalid now.


was (Author: lbdai3190):
{code:java}
//代码占位符
public enum States {
CONNECTING, ASSOCIATING, CONNECTED, CONNECTEDREADONLY,
CLOSED, AUTH_FAILED, NOT_CONNECTED;

public boolean isAlive() {  // when zk client state is CONNECTING 
or  NOT_CONNECTED, obviously, zk session is not valid, we can't regard as 
zookeeper is Alive
return this != CLOSED && this != AUTH_FAILED;
}

/**
 * Returns whether we are connected to a server (which
 * could possibly be read-only, if this client is allowed
 * to go to read-only mode)
 * */
public boolean isConnected() {
return this == CONNECTED || this == CONNECTEDREADONLY;
}
}
{code}

> controller-event-thread deadlock with zk-session-expiry-handler0
> 
>
> Key: KAFKA-8532
> URL: https://issues.apache.org/jira/browse/KAFKA-8532
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Blocker
> Attachments: controllerDeadLockAnalysis-2020-01-20.png, js.log, 
> js0.log, js1.log, js2.log
>
>
> We have observed a serious deadlock between controller-event-thead and 
> zk-session-expirey-handle thread. When this issue occurred, it's only one way 
> to recovery the kafka cluster is restart kafka server. The  follows is the 
> jstack log of controller-event-thead and zk-session-expiry-handle thread.
> "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 
> tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005ee3f7000> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // 
> 等待controller-event-thread线程处理expireEvent
>  at 
> kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533)
>  at 
> kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173)
>  at 
> kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374)
>  at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown 
> Source)
>  at scala.collection.Iterator.foreach(Iterator.scala:937)
>  at scala.collection.Iterator.foreach$(Iterator.scala:937)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1425)
>  at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209)
>  at kafka.zookeep

[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2020-01-21 Thread leibo (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019851#comment-17019851
 ] 

leibo edited comment on KAFKA-8532 at 1/21/20 8:24 AM:
---

[~junrao] 

I'm agree with your opinion "all pending ZK requests should complete with a 
SessionExpired error through the response callback". 

As the attachment controllerDeadLockAnalysis-2020-01-20.png I uploaded.

RegisterBrokerAndReelect need a established zookeeper session to process finish 
by controller-event-thread,Expired need be process by controller-event-thread 
to established zookeeper session, but controller-event-thread is busy to 
handing RegisterBrokerAndReelect.

 

According to current analysis, I think it's tow ways to deal with this problem:

1.  Use more strict judgment conditions to prevent ControllerEvent  
RegisterBrokerAndReelect  to be execute when zk session expired.

2.  Found out the reason why handleRequest blocked when zookeeper session 
Expired. 


was (Author: lbdai3190):
[~junrao] 

I'm agree with your opinion "all pending ZK requests should complete with a 
SessionExpired error through the response callback". 

As the attachment controllerDeadLockAnalysis-2020-01-20.png I uploaded.

RegisterBrokerAndReelect need a established zookeeper session to process finish 
by controller-event-thread,Expired need be process by controller-event-thread 
to established zookeeper session, but controller-event-thread is busy to 
handing RegisterBrokerAndReelect.

 

According to current analysis, I think it's tow ways to deal with this problem:

1.  Use more strict judgment conditions to prevent ControllerEvent  
RegisterBrokerAndReelect  to be execute when zk session expired.

2.  Found out the reason why handleRequest blocked when zookeeper session 
Expired. 

 

 

 

> controller-event-thread deadlock with zk-session-expiry-handler0
> 
>
> Key: KAFKA-8532
> URL: https://issues.apache.org/jira/browse/KAFKA-8532
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Blocker
> Attachments: controllerDeadLockAnalysis-2020-01-20.png, js.log, 
> js0.log, js1.log, js2.log
>
>
> We have observed a serious deadlock between controller-event-thead and 
> zk-session-expirey-handle thread. When this issue occurred, it's only one way 
> to recovery the kafka cluster is restart kafka server. The  follows is the 
> jstack log of controller-event-thead and zk-session-expiry-handle thread.
> "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 
> tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005ee3f7000> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // 
> 等待controller-event-thread线程处理expireEvent
>  at 
> kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533)
>  at 
> kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173)
>  at 
> kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374)
>  at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown 
> Source)
>  at scala.collection.Iterator.foreach(Iterator.scala:937)
>  at scala.collection.Iterator.foreach$(Iterator.scala:937)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1425)
>  at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209)
>  at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428)
>  at 
> kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114)
>  at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63)
>  at 

[jira] [Commented] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2020-01-21 Thread leibo (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17020027#comment-17020027
 ] 

leibo commented on KAFKA-8532:
--

{code:java}
//代码占位符
public enum States {
CONNECTING, ASSOCIATING, CONNECTED, CONNECTEDREADONLY,
CLOSED, AUTH_FAILED, NOT_CONNECTED;

public boolean isAlive() {  // when zk client state is CONNECTING 
or  NOT_CONNECTED, obviously, zk session is not valid, we can't regard as 
zookeeper is Alive
return this != CLOSED && this != AUTH_FAILED;
}

/**
 * Returns whether we are connected to a server (which
 * could possibly be read-only, if this client is allowed
 * to go to read-only mode)
 * */
public boolean isConnected() {
return this == CONNECTED || this == CONNECTEDREADONLY;
}
}
{code}

> controller-event-thread deadlock with zk-session-expiry-handler0
> 
>
> Key: KAFKA-8532
> URL: https://issues.apache.org/jira/browse/KAFKA-8532
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Blocker
> Attachments: controllerDeadLockAnalysis-2020-01-20.png, js.log, 
> js0.log, js1.log, js2.log
>
>
> We have observed a serious deadlock between controller-event-thead and 
> zk-session-expirey-handle thread. When this issue occurred, it's only one way 
> to recovery the kafka cluster is restart kafka server. The  follows is the 
> jstack log of controller-event-thead and zk-session-expiry-handle thread.
> "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 
> tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005ee3f7000> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // 
> 等待controller-event-thread线程处理expireEvent
>  at 
> kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533)
>  at 
> kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173)
>  at 
> kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374)
>  at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown 
> Source)
>  at scala.collection.Iterator.foreach(Iterator.scala:937)
>  at scala.collection.Iterator.foreach$(Iterator.scala:937)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1425)
>  at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209)
>  at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428)
>  at 
> kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114)
>  at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> Locked ownable synchronizers:
>  - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker)
> "controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 
> nid=0x310 waiting on condition [0x7fccb55c8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005d1be5a00> (a 
> java.util.concurrent.CountD

[jira] [Commented] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2020-01-20 Thread leibo (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019851#comment-17019851
 ] 

leibo commented on KAFKA-8532:
--

[~junrao] 

I'm agree with your opinion "all pending ZK requests should complete with a 
SessionExpired error through the response callback". 

As the attachment controllerDeadLockAnalysis-2020-01-20.png I uploaded.

RegisterBrokerAndReelect need a established zookeeper session to process finish 
by controller-event-thread,Expired need be process by controller-event-thread 
to established zookeeper session, but controller-event-thread is busy to 
handing RegisterBrokerAndReelect.

 

According to current analysis, I think it's tow ways to deal with this problem:

1.  Use more strict judgment conditions to prevent ControllerEvent  
RegisterBrokerAndReelect  to be execute when zk session expired.

2.  Found out the reason why handleRequest blocked when zookeeper session 
Expired. 

 

 

 

> controller-event-thread deadlock with zk-session-expiry-handler0
> 
>
> Key: KAFKA-8532
> URL: https://issues.apache.org/jira/browse/KAFKA-8532
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Blocker
> Attachments: controllerDeadLockAnalysis-2020-01-20.png, js.log, 
> js0.log, js1.log, js2.log
>
>
> We have observed a serious deadlock between controller-event-thead and 
> zk-session-expirey-handle thread. When this issue occurred, it's only one way 
> to recovery the kafka cluster is restart kafka server. The  follows is the 
> jstack log of controller-event-thead and zk-session-expiry-handle thread.
> "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 
> tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005ee3f7000> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // 
> 等待controller-event-thread线程处理expireEvent
>  at 
> kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533)
>  at 
> kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173)
>  at 
> kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374)
>  at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown 
> Source)
>  at scala.collection.Iterator.foreach(Iterator.scala:937)
>  at scala.collection.Iterator.foreach$(Iterator.scala:937)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1425)
>  at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209)
>  at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428)
>  at 
> kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114)
>  at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> Locked ownable synchronizers:
>  - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker)
> "controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 
> nid=0x310 waiting on condition [0x7fccb55c8000]
>  java.lang.Thread.State:

[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2020-01-20 Thread leibo (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019294#comment-17019294
 ] 

leibo edited comment on KAFKA-8532 at 1/21/20 1:44 AM:
---

[~junrao], [~yuzhih...@gmail.com]  Thank you for you reply,

1. I have tested the Exception in handleRequest(),  and as you a said , 
excpetions throws in handleRequest method will not block on 
countDownLatch.await().

2. I recheck the jstack log in attachment, and there is my analysis result as 
attachment *controllerDeadLockAnalysis-2020-01-20.png,* 
{code:java}
//代码占位符
private def reinitialize(): Unit = {
  // Initialization callbacks are invoked outside of the lock to avoid deadlock 
potential since their completion
  // may require additional Zookeeper requests, which will block to acquire the 
initialization lock
  stateChangeHandlers.values.foreach(callBeforeInitializingSession _)

  inWriteLock(initializationLock) {
//  if (!connectionState.isAlive) {   // 此处判断可能不准确
 if (!connectionState.isConnected) { // If use isConnected to 
determine whether need reinitialize zookeeper session, program will not skip 
this and to execute some operation that need zookeeper be connected.
  zooKeeper.close()
  info(s"Initializing a new session to $connectString.")
  // retry forever until ZooKeeper can be instantiated
  var connected = false
  while (!connected) {
try {
  zooKeeper = new ZooKeeper(connectString, sessionTimeoutMs, 
ZooKeeperClientWatcher)
  connected = true
} catch {
  case e: Exception =>
info("Error when recreating ZooKeeper, retrying after a short 
sleep", e)
Thread.sleep(1000)
}
  }
}
  }

  stateChangeHandlers.values.foreach(callAfterInitializingSession _)
}{code}
Zookeeper Code:
{code:java}
//代码占位符
@InterfaceAudience.Public
public enum States {
CONNECTING, ASSOCIATING, CONNECTED, CONNECTEDREADONLY,
CLOSED, AUTH_FAILED, NOT_CONNECTED;

public boolean isAlive() {
return this != CLOSED && this != AUTH_FAILED;   
}

/**
 * Returns whether we are connected to a server (which
 * could possibly be read-only, if this client is allowed
 * to go to read-only mode)
 * */
public boolean isConnected() {
return this == CONNECTED || this == CONNECTEDREADONLY;
}
}
{code}
I hope you pay attention on it.


was (Author: lbdai3190):
[~junrao], [~yuzhih...@gmail.com]  Thank you for you reply,

1. I have tested the Exception in handleRequest(),  and as you a said , 
excpetions throws in handleRequest method will not block on 
countDownLatch.await().

2. I recheck the jstack log in attachment, and there is my analysis result as 
attachment controllerDeadLockAnalysis-2020-01-20.png, 
{code:java}
//代码占位符
private def reinitialize(): Unit = {
  // Initialization callbacks are invoked outside of the lock to avoid deadlock 
potential since their completion
  // may require additional Zookeeper requests, which will block to acquire the 
initialization lock
  stateChangeHandlers.values.foreach(callBeforeInitializingSession _)

  inWriteLock(initializationLock) {
//  if (!connectionState.isAlive) {   // 此处判断可能不准确
 if (!connectionState.isConnected) { // If use isConnected to 
determine whether need reinitialize zookeeper session, program will not skip 
this and to execute some operation that need zookeeper be connected.
  zooKeeper.close()
  info(s"Initializing a new session to $connectString.")
  // retry forever until ZooKeeper can be instantiated
  var connected = false
  while (!connected) {
try {
  zooKeeper = new ZooKeeper(connectString, sessionTimeoutMs, 
ZooKeeperClientWatcher)
  connected = true
} catch {
  case e: Exception =>
info("Error when recreating ZooKeeper, retrying after a short 
sleep", e)
Thread.sleep(1000)
}
  }
}
  }

  stateChangeHandlers.values.foreach(callAfterInitializingSession _)
}{code}
Zookeeper Code:
{code:java}
//代码占位符
@InterfaceAudience.Public
public enum States {
CONNECTING, ASSOCIATING, CONNECTED, CONNECTEDREADONLY,
CLOSED, AUTH_FAILED, NOT_CONNECTED;

public boolean isAlive() {
return this != CLOSED && this != AUTH_FAILED;   
}

/**
 * Returns whether we are connected to a server (which
 * could possibly be read-only, if this client is allowed
 * to go to read-only mode)
 * */
public boolean isConnected() {
return this == CONNECTED || this == CONNECTEDREADONLY;
}
}
{code}
I hope you pay attention on it.

> controller-event-thread deadlock with zk-session-expiry-handler0
> 
>
> Key: KAFKA-8532
> URL: https://issues.apache.org/jira/browse/KAFKA-

[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2020-01-20 Thread leibo (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019294#comment-17019294
 ] 

leibo edited comment on KAFKA-8532 at 1/21/20 1:44 AM:
---

[~junrao], [~yuzhih...@gmail.com]  Thank you for you reply,

1. I have tested the Exception in handleRequest(),  and as you a said , 
excpetions throws in handleRequest method will not block on 
countDownLatch.await().

2. I recheck the jstack log in attachment, and there is my analysis result as 
*attachment* *controllerDeadLockAnalysis-2020-01-20.png,* 
{code:java}
//代码占位符
private def reinitialize(): Unit = {
  // Initialization callbacks are invoked outside of the lock to avoid deadlock 
potential since their completion
  // may require additional Zookeeper requests, which will block to acquire the 
initialization lock
  stateChangeHandlers.values.foreach(callBeforeInitializingSession _)

  inWriteLock(initializationLock) {
//  if (!connectionState.isAlive) {   // 此处判断可能不准确
 if (!connectionState.isConnected) { // If use isConnected to 
determine whether need reinitialize zookeeper session, program will not skip 
this and to execute some operation that need zookeeper be connected.
  zooKeeper.close()
  info(s"Initializing a new session to $connectString.")
  // retry forever until ZooKeeper can be instantiated
  var connected = false
  while (!connected) {
try {
  zooKeeper = new ZooKeeper(connectString, sessionTimeoutMs, 
ZooKeeperClientWatcher)
  connected = true
} catch {
  case e: Exception =>
info("Error when recreating ZooKeeper, retrying after a short 
sleep", e)
Thread.sleep(1000)
}
  }
}
  }

  stateChangeHandlers.values.foreach(callAfterInitializingSession _)
}{code}
Zookeeper Code:
{code:java}
//代码占位符
@InterfaceAudience.Public
public enum States {
CONNECTING, ASSOCIATING, CONNECTED, CONNECTEDREADONLY,
CLOSED, AUTH_FAILED, NOT_CONNECTED;

public boolean isAlive() {
return this != CLOSED && this != AUTH_FAILED;   
}

/**
 * Returns whether we are connected to a server (which
 * could possibly be read-only, if this client is allowed
 * to go to read-only mode)
 * */
public boolean isConnected() {
return this == CONNECTED || this == CONNECTEDREADONLY;
}
}
{code}
I hope you pay attention on it.


was (Author: lbdai3190):
[~junrao], [~yuzhih...@gmail.com]  Thank you for you reply,

1. I have tested the Exception in handleRequest(),  and as you a said , 
excpetions throws in handleRequest method will not block on 
countDownLatch.await().

2. I recheck the jstack log in attachment, and there is my analysis result as 
attachment *controllerDeadLockAnalysis-2020-01-20.png,* 
{code:java}
//代码占位符
private def reinitialize(): Unit = {
  // Initialization callbacks are invoked outside of the lock to avoid deadlock 
potential since their completion
  // may require additional Zookeeper requests, which will block to acquire the 
initialization lock
  stateChangeHandlers.values.foreach(callBeforeInitializingSession _)

  inWriteLock(initializationLock) {
//  if (!connectionState.isAlive) {   // 此处判断可能不准确
 if (!connectionState.isConnected) { // If use isConnected to 
determine whether need reinitialize zookeeper session, program will not skip 
this and to execute some operation that need zookeeper be connected.
  zooKeeper.close()
  info(s"Initializing a new session to $connectString.")
  // retry forever until ZooKeeper can be instantiated
  var connected = false
  while (!connected) {
try {
  zooKeeper = new ZooKeeper(connectString, sessionTimeoutMs, 
ZooKeeperClientWatcher)
  connected = true
} catch {
  case e: Exception =>
info("Error when recreating ZooKeeper, retrying after a short 
sleep", e)
Thread.sleep(1000)
}
  }
}
  }

  stateChangeHandlers.values.foreach(callAfterInitializingSession _)
}{code}
Zookeeper Code:
{code:java}
//代码占位符
@InterfaceAudience.Public
public enum States {
CONNECTING, ASSOCIATING, CONNECTED, CONNECTEDREADONLY,
CLOSED, AUTH_FAILED, NOT_CONNECTED;

public boolean isAlive() {
return this != CLOSED && this != AUTH_FAILED;   
}

/**
 * Returns whether we are connected to a server (which
 * could possibly be read-only, if this client is allowed
 * to go to read-only mode)
 * */
public boolean isConnected() {
return this == CONNECTED || this == CONNECTEDREADONLY;
}
}
{code}
I hope you pay attention on it.

> controller-event-thread deadlock with zk-session-expiry-handler0
> 
>
> Key: KAFKA-8532
> URL: https://issues.apache.org/jira/browse/KA

[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2020-01-20 Thread leibo (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019294#comment-17019294
 ] 

leibo edited comment on KAFKA-8532 at 1/21/20 1:43 AM:
---

[~junrao], [~yuzhih...@gmail.com]  Thank you for you reply,

1. I have tested the Exception in handleRequest(),  and as you a said , 
excpetions throws in handleRequest method will not block on 
countDownLatch.await().

2. I recheck the jstack log in attachment, and there is my analysis result as 
attachment controllerDeadLockAnalysis-2020-01-20.png, 
{code:java}
//代码占位符
private def reinitialize(): Unit = {
  // Initialization callbacks are invoked outside of the lock to avoid deadlock 
potential since their completion
  // may require additional Zookeeper requests, which will block to acquire the 
initialization lock
  stateChangeHandlers.values.foreach(callBeforeInitializingSession _)

  inWriteLock(initializationLock) {
//  if (!connectionState.isAlive) {   // 此处判断可能不准确
 if (!connectionState.isConnected) { // If use isConnected to 
determine whether need reinitialize zookeeper session, program will not skip 
this and to execute some operation that need zookeeper be connected.
  zooKeeper.close()
  info(s"Initializing a new session to $connectString.")
  // retry forever until ZooKeeper can be instantiated
  var connected = false
  while (!connected) {
try {
  zooKeeper = new ZooKeeper(connectString, sessionTimeoutMs, 
ZooKeeperClientWatcher)
  connected = true
} catch {
  case e: Exception =>
info("Error when recreating ZooKeeper, retrying after a short 
sleep", e)
Thread.sleep(1000)
}
  }
}
  }

  stateChangeHandlers.values.foreach(callAfterInitializingSession _)
}{code}
Zookeeper Code:
{code:java}
//代码占位符
@InterfaceAudience.Public
public enum States {
CONNECTING, ASSOCIATING, CONNECTED, CONNECTEDREADONLY,
CLOSED, AUTH_FAILED, NOT_CONNECTED;

public boolean isAlive() {
return this != CLOSED && this != AUTH_FAILED;   
}

/**
 * Returns whether we are connected to a server (which
 * could possibly be read-only, if this client is allowed
 * to go to read-only mode)
 * */
public boolean isConnected() {
return this == CONNECTED || this == CONNECTEDREADONLY;
}
}
{code}
I hope you pay attention on it.


was (Author: lbdai3190):
[~junrao], [~yuzhih...@gmail.com]  Thank you for you reply,

1. I have tested the Exception in handleRequest(),  and as you a said , 
excpetions throws in handleRequest method will not block on 
countDownLatch.await().

2. I recheck the jstack log in attachment, and there is my analysis result as 
attachment controllerDeadLockAnalysis-2020-01-20.png, 
{code:java}
//代码占位符
private def reinitialize(): Unit = {
  // Initialization callbacks are invoked outside of the lock to avoid deadlock 
potential since their completion
  // may require additional Zookeeper requests, which will block to acquire the 
initialization lock
  stateChangeHandlers.values.foreach(callBeforeInitializingSession _)

  inWriteLock(initializationLock) {
//  if (!connectionState.isAlive) {   // 此处判断可能不准确
 if (!connectionState.isConnected) { // If use isConnected to 
determine whether need reinitialize zookeeper session, program will not skip 
this and to execute some operation that need zookeeper be connected.
  zooKeeper.close()
  info(s"Initializing a new session to $connectString.")
  // retry forever until ZooKeeper can be instantiated
  var connected = false
  while (!connected) {
try {
  zooKeeper = new ZooKeeper(connectString, sessionTimeoutMs, 
ZooKeeperClientWatcher)
  connected = true
} catch {
  case e: Exception =>
info("Error when recreating ZooKeeper, retrying after a short 
sleep", e)
Thread.sleep(1000)
}
  }
}
  }

  stateChangeHandlers.values.foreach(callAfterInitializingSession _)
}{code}
Zookeeper Code:
{code:java}
//代码占位符
@InterfaceAudience.Public
public enum States {
CONNECTING, ASSOCIATING, CONNECTED, CONNECTEDREADONLY,
CLOSED, AUTH_FAILED, NOT_CONNECTED;

public boolean isAlive() {
return this != CLOSED && this != AUTH_FAILED;   // 此处判断不准确
}

/**
 * Returns whether we are connected to a server (which
 * could possibly be read-only, if this client is allowed
 * to go to read-only mode)
 * */
public boolean isConnected() {
return this == CONNECTED || this == CONNECTEDREADONLY;
}
}
{code}
I hope you pay attention on it.

> controller-event-thread deadlock with zk-session-expiry-handler0
> 
>
> Key: KAFKA-8532
> URL: https://issues.apache.org/ji

[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2020-01-20 Thread leibo (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019294#comment-17019294
 ] 

leibo edited comment on KAFKA-8532 at 1/21/20 1:43 AM:
---

[~junrao], [~yuzhih...@gmail.com]  Thank you for you reply,

1. I have tested the Exception in handleRequest(),  and as you a said , 
excpetions throws in handleRequest method will not block on 
countDownLatch.await().

2. I recheck the jstack log in attachment, and there is my analysis result as 
attachment controllerDeadLockAnalysis-2020-01-20.png, 
{code:java}
//代码占位符
private def reinitialize(): Unit = {
  // Initialization callbacks are invoked outside of the lock to avoid deadlock 
potential since their completion
  // may require additional Zookeeper requests, which will block to acquire the 
initialization lock
  stateChangeHandlers.values.foreach(callBeforeInitializingSession _)

  inWriteLock(initializationLock) {
//  if (!connectionState.isAlive) {   // 此处判断可能不准确
 if (!connectionState.isConnected) { // If use isConnected to 
determine whether need reinitialize zookeeper session, program will not skip 
this and to execute some operation that need zookeeper be connected.
  zooKeeper.close()
  info(s"Initializing a new session to $connectString.")
  // retry forever until ZooKeeper can be instantiated
  var connected = false
  while (!connected) {
try {
  zooKeeper = new ZooKeeper(connectString, sessionTimeoutMs, 
ZooKeeperClientWatcher)
  connected = true
} catch {
  case e: Exception =>
info("Error when recreating ZooKeeper, retrying after a short 
sleep", e)
Thread.sleep(1000)
}
  }
}
  }

  stateChangeHandlers.values.foreach(callAfterInitializingSession _)
}{code}
Zookeeper Code:
{code:java}
//代码占位符
@InterfaceAudience.Public
public enum States {
CONNECTING, ASSOCIATING, CONNECTED, CONNECTEDREADONLY,
CLOSED, AUTH_FAILED, NOT_CONNECTED;

public boolean isAlive() {
return this != CLOSED && this != AUTH_FAILED;   // 此处判断不准确
}

/**
 * Returns whether we are connected to a server (which
 * could possibly be read-only, if this client is allowed
 * to go to read-only mode)
 * */
public boolean isConnected() {
return this == CONNECTED || this == CONNECTEDREADONLY;
}
}
{code}
I hope you pay attention on it.


was (Author: lbdai3190):
[~junrao], [~yuzhih...@gmail.com]  Thank you for you reply,

1. I have tested the Exception in handleRequest(),  and as you a said , 
excpetions throws in handleRequest method will not block on 
countDownLatch.await().

2. I recheck the jstack log in attachment, and there is my analysis result as 
attachment controllerDeadLockAnalysis-2020-01-20.png,  I hope you pay attention 
on it.

> controller-event-thread deadlock with zk-session-expiry-handler0
> 
>
> Key: KAFKA-8532
> URL: https://issues.apache.org/jira/browse/KAFKA-8532
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Blocker
> Attachments: controllerDeadLockAnalysis-2020-01-20.png, js.log, 
> js0.log, js1.log, js2.log
>
>
> We have observed a serious deadlock between controller-event-thead and 
> zk-session-expirey-handle thread. When this issue occurred, it's only one way 
> to recovery the kafka cluster is restart kafka server. The  follows is the 
> jstack log of controller-event-thead and zk-session-expiry-handle thread.
> "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 
> tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005ee3f7000> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // 
> 等待controller-event-thread线程处理expireEvent
>  at 
> kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533)
>  at 
> kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173)
>  at 
> kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408)
>  at 
> kafka.zookeeper.Z

[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2020-01-20 Thread leibo (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019294#comment-17019294
 ] 

leibo edited comment on KAFKA-8532 at 1/20/20 8:28 AM:
---

[~junrao], [~yuzhih...@gmail.com]  Thank you for you reply,

1. I have tested the Exception in handleRequest(),  and as you a said , 
excpetions throws in handleRequest method will not block on 
countDownLatch.await().

2. I recheck the jstack log in attachment, and there is my analysis result as 
attachment controllerDeadLockAnalysis-2020-01-20.png,  I hope you pay attention 
on it.


was (Author: lbdai3190):
[~junrao] Thank you for you reply,

1. I have tested the Exception in handleRequest(),  and as you a said , 
excpetions throws in handleRequest method will not block on 
countDownLatch.await().

2. I recheck the jstack log in attachment, and there is my analysis result as 
attachment controllerDeadLockAnalysis-2020-01-20.png,  I hope you pay attention 
on it.

> controller-event-thread deadlock with zk-session-expiry-handler0
> 
>
> Key: KAFKA-8532
> URL: https://issues.apache.org/jira/browse/KAFKA-8532
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Blocker
> Attachments: controllerDeadLockAnalysis-2020-01-20.png, js.log, 
> js0.log, js1.log, js2.log
>
>
> We have observed a serious deadlock between controller-event-thead and 
> zk-session-expirey-handle thread. When this issue occurred, it's only one way 
> to recovery the kafka cluster is restart kafka server. The  follows is the 
> jstack log of controller-event-thead and zk-session-expiry-handle thread.
> "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 
> tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005ee3f7000> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // 
> 等待controller-event-thread线程处理expireEvent
>  at 
> kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533)
>  at 
> kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173)
>  at 
> kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374)
>  at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown 
> Source)
>  at scala.collection.Iterator.foreach(Iterator.scala:937)
>  at scala.collection.Iterator.foreach$(Iterator.scala:937)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1425)
>  at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209)
>  at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428)
>  at 
> kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114)
>  at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> Locked ownable synchronizers:
>  - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker)
> "controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 
> nid=0x310 waiting on condition [0x7fccb55c8000]

[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2020-01-20 Thread leibo (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019294#comment-17019294
 ] 

leibo edited comment on KAFKA-8532 at 1/20/20 8:26 AM:
---

[~junrao] Thank you for you reply,

1. I have tested the Exception in handleRequest(),  and as you a said , 
excpetions throws in handleRequest method will not block on 
countDownLatch.await().

2. I recheck the jstack log in attachment, and there is my analysis result as 
attachment controllerDeadLockAnalysis-2020-01-20.png,  I hope you pay attention 
on it.


was (Author: lbdai3190):
[~junrao] Thank you for you reply,

1. I have tested the Exception in handleRequest(),  and as you a said , 
excpetions throws in handleRequest will not block on countDownLatch.await().

2. I recheck the jstack log in attachment, and there is my analysis result as 
attachment controllerDeadLockAnalysis-2020-01-20.png, 

> controller-event-thread deadlock with zk-session-expiry-handler0
> 
>
> Key: KAFKA-8532
> URL: https://issues.apache.org/jira/browse/KAFKA-8532
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Blocker
> Attachments: controllerDeadLockAnalysis-2020-01-20.png, js.log, 
> js0.log, js1.log, js2.log
>
>
> We have observed a serious deadlock between controller-event-thead and 
> zk-session-expirey-handle thread. When this issue occurred, it's only one way 
> to recovery the kafka cluster is restart kafka server. The  follows is the 
> jstack log of controller-event-thead and zk-session-expiry-handle thread.
> "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 
> tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005ee3f7000> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // 
> 等待controller-event-thread线程处理expireEvent
>  at 
> kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533)
>  at 
> kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173)
>  at 
> kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374)
>  at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown 
> Source)
>  at scala.collection.Iterator.foreach(Iterator.scala:937)
>  at scala.collection.Iterator.foreach$(Iterator.scala:937)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1425)
>  at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209)
>  at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428)
>  at 
> kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114)
>  at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> Locked ownable synchronizers:
>  - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker)
> "controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 
> nid=0x310 waiting on condition [0x7fccb55c8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe

[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2020-01-20 Thread leibo (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019294#comment-17019294
 ] 

leibo edited comment on KAFKA-8532 at 1/20/20 8:21 AM:
---

[~junrao] Thank you for you reply,

1. I have tested the Exception in handleRequest(),  and as you a said , 
excpetions throws in handleRequest will not block on countDownLatch.await().

2. I recheck the jstack log in attachment, and there is my analysis result as 
attachment controllerDeadLockAnalysis-2020-01-20.png, 


was (Author: lbdai3190):
[~junrao] Thank you for you reply,

1. I have tested the Exception in handleRequest(),  and as you a said , 
excpetions throws in handleRequest will not block on countDownLatch.await().

2. I recheck the jstack log in attachment, and there is my analysis result as 
attachment controllerDeadLockAnalysis2020-01-20.png, 

!controllerDeadLockAnalysis-2020-01-20.png!

> controller-event-thread deadlock with zk-session-expiry-handler0
> 
>
> Key: KAFKA-8532
> URL: https://issues.apache.org/jira/browse/KAFKA-8532
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Blocker
> Attachments: controllerDeadLockAnalysis-2020-01-20.png, js.log, 
> js0.log, js1.log, js2.log
>
>
> We have observed a serious deadlock between controller-event-thead and 
> zk-session-expirey-handle thread. When this issue occurred, it's only one way 
> to recovery the kafka cluster is restart kafka server. The  follows is the 
> jstack log of controller-event-thead and zk-session-expiry-handle thread.
> "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 
> tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005ee3f7000> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // 
> 等待controller-event-thread线程处理expireEvent
>  at 
> kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533)
>  at 
> kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173)
>  at 
> kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374)
>  at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown 
> Source)
>  at scala.collection.Iterator.foreach(Iterator.scala:937)
>  at scala.collection.Iterator.foreach$(Iterator.scala:937)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1425)
>  at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209)
>  at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428)
>  at 
> kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114)
>  at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> Locked ownable synchronizers:
>  - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker)
> "controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 
> nid=0x310 waiting on condition [0x7fccb55c8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Uns

[jira] [Updated] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2020-01-20 Thread leibo (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leibo updated KAFKA-8532:
-
Attachment: controllerDeadLockAnalysis-2020-01-20.png

> controller-event-thread deadlock with zk-session-expiry-handler0
> 
>
> Key: KAFKA-8532
> URL: https://issues.apache.org/jira/browse/KAFKA-8532
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Blocker
> Attachments: controllerDeadLockAnalysis-2020-01-20.png, js.log, 
> js0.log, js1.log, js2.log
>
>
> We have observed a serious deadlock between controller-event-thead and 
> zk-session-expirey-handle thread. When this issue occurred, it's only one way 
> to recovery the kafka cluster is restart kafka server. The  follows is the 
> jstack log of controller-event-thead and zk-session-expiry-handle thread.
> "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 
> tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005ee3f7000> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // 
> 等待controller-event-thread线程处理expireEvent
>  at 
> kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533)
>  at 
> kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173)
>  at 
> kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374)
>  at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown 
> Source)
>  at scala.collection.Iterator.foreach(Iterator.scala:937)
>  at scala.collection.Iterator.foreach$(Iterator.scala:937)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1425)
>  at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209)
>  at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428)
>  at 
> kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114)
>  at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> Locked ownable synchronizers:
>  - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker)
> "controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 
> nid=0x310 waiting on condition [0x7fccb55c8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005d1be5a00> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
>  at kafka.zookeeper.ZooKeeperClient.handleRequests(ZooKeeperClient.scala:157)
>  at 
> kafka.zk.KafkaZkClient.retryReq

[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2020-01-20 Thread leibo (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019294#comment-17019294
 ] 

leibo edited comment on KAFKA-8532 at 1/20/20 8:20 AM:
---

[~junrao] Thank you for you reply,

1. I have tested the Exception in handleRequest(),  and as you a said , 
excpetions throws in handleRequest will not block on countDownLatch.await().

2. I recheck the jstack log in attachment, and there is my analysis result as 
attachment controllerDeadLockAnalysis2020-01-20.png, 

!controllerDeadLockAnalysis-2020-01-20.png!


was (Author: lbdai3190):
[~junrao] Thank you for you reply,

1. I have tested the Exception in handleRequest(),  and as you a said , 
excpetions throws in handleRequest will not block on countDownLatch.await().

2. I recheck the jstack log in attachment, and there is my analysis result as 
attachment controllerDeadLockAnalysis2020-01-20.png, 
!controllerDeadLockAnalysis2020-01-20.png!

> controller-event-thread deadlock with zk-session-expiry-handler0
> 
>
> Key: KAFKA-8532
> URL: https://issues.apache.org/jira/browse/KAFKA-8532
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Blocker
> Attachments: controllerDeadLockAnalysis-2020-01-20.png, js.log, 
> js0.log, js1.log, js2.log
>
>
> We have observed a serious deadlock between controller-event-thead and 
> zk-session-expirey-handle thread. When this issue occurred, it's only one way 
> to recovery the kafka cluster is restart kafka server. The  follows is the 
> jstack log of controller-event-thead and zk-session-expiry-handle thread.
> "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 
> tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005ee3f7000> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // 
> 等待controller-event-thread线程处理expireEvent
>  at 
> kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533)
>  at 
> kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173)
>  at 
> kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374)
>  at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown 
> Source)
>  at scala.collection.Iterator.foreach(Iterator.scala:937)
>  at scala.collection.Iterator.foreach$(Iterator.scala:937)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1425)
>  at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209)
>  at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428)
>  at 
> kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114)
>  at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> Locked ownable synchronizers:
>  - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker)
> "controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 
> nid=0x310 waiting on condition [0x7fccb55c8000]
>  java.lang.Thread.S

[jira] [Updated] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2020-01-20 Thread leibo (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leibo updated KAFKA-8532:
-
Attachment: (was: controllerDeadLockAnalysis2020-01-20.png)

> controller-event-thread deadlock with zk-session-expiry-handler0
> 
>
> Key: KAFKA-8532
> URL: https://issues.apache.org/jira/browse/KAFKA-8532
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Blocker
> Attachments: js.log, js0.log, js1.log, js2.log
>
>
> We have observed a serious deadlock between controller-event-thead and 
> zk-session-expirey-handle thread. When this issue occurred, it's only one way 
> to recovery the kafka cluster is restart kafka server. The  follows is the 
> jstack log of controller-event-thead and zk-session-expiry-handle thread.
> "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 
> tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005ee3f7000> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // 
> 等待controller-event-thread线程处理expireEvent
>  at 
> kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533)
>  at 
> kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173)
>  at 
> kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374)
>  at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown 
> Source)
>  at scala.collection.Iterator.foreach(Iterator.scala:937)
>  at scala.collection.Iterator.foreach$(Iterator.scala:937)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1425)
>  at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209)
>  at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428)
>  at 
> kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114)
>  at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> Locked ownable synchronizers:
>  - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker)
> "controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 
> nid=0x310 waiting on condition [0x7fccb55c8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005d1be5a00> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
>  at kafka.zookeeper.ZooKeeperClient.handleRequests(ZooKeeperClient.scala:157)
>  at 
> kafka.zk.KafkaZkClient.retryRequestsUntilConnected(KafkaZkClient.sc

[jira] [Commented] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2020-01-20 Thread leibo (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019294#comment-17019294
 ] 

leibo commented on KAFKA-8532:
--

[~junrao] Thank you for you reply,

1. I have tested the Exception in handleRequest(),  and as you a said , 
excpetions throws in handleRequest will not block on countDownLatch.await().

2. I recheck the jstack log in attachment, and there is my analysis result as 
attachment controllerDeadLockAnalysis2020-01-20.png, 
!controllerDeadLockAnalysis2020-01-20.png!

> controller-event-thread deadlock with zk-session-expiry-handler0
> 
>
> Key: KAFKA-8532
> URL: https://issues.apache.org/jira/browse/KAFKA-8532
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Blocker
> Attachments: controllerDeadLockAnalysis2020-01-20.png, js.log, 
> js0.log, js1.log, js2.log
>
>
> We have observed a serious deadlock between controller-event-thead and 
> zk-session-expirey-handle thread. When this issue occurred, it's only one way 
> to recovery the kafka cluster is restart kafka server. The  follows is the 
> jstack log of controller-event-thead and zk-session-expiry-handle thread.
> "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 
> tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005ee3f7000> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // 
> 等待controller-event-thread线程处理expireEvent
>  at 
> kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533)
>  at 
> kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173)
>  at 
> kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374)
>  at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown 
> Source)
>  at scala.collection.Iterator.foreach(Iterator.scala:937)
>  at scala.collection.Iterator.foreach$(Iterator.scala:937)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1425)
>  at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209)
>  at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428)
>  at 
> kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114)
>  at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> Locked ownable synchronizers:
>  - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker)
> "controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 
> nid=0x310 waiting on condition [0x7fccb55c8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005d1be5a00> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterrupti

[jira] [Updated] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2020-01-20 Thread leibo (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leibo updated KAFKA-8532:
-
Attachment: controllerDeadLockAnalysis2020-01-20.png

> controller-event-thread deadlock with zk-session-expiry-handler0
> 
>
> Key: KAFKA-8532
> URL: https://issues.apache.org/jira/browse/KAFKA-8532
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Blocker
> Attachments: controllerDeadLockAnalysis2020-01-20.png, js.log, 
> js0.log, js1.log, js2.log
>
>
> We have observed a serious deadlock between controller-event-thead and 
> zk-session-expirey-handle thread. When this issue occurred, it's only one way 
> to recovery the kafka cluster is restart kafka server. The  follows is the 
> jstack log of controller-event-thead and zk-session-expiry-handle thread.
> "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 
> tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005ee3f7000> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // 
> 等待controller-event-thread线程处理expireEvent
>  at 
> kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533)
>  at 
> kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173)
>  at 
> kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374)
>  at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown 
> Source)
>  at scala.collection.Iterator.foreach(Iterator.scala:937)
>  at scala.collection.Iterator.foreach$(Iterator.scala:937)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1425)
>  at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209)
>  at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428)
>  at 
> kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114)
>  at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> Locked ownable synchronizers:
>  - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker)
> "controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 
> nid=0x310 waiting on condition [0x7fccb55c8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005d1be5a00> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
>  at kafka.zookeeper.ZooKeeperClient.handleRequests(ZooKeeperClient.scala:157)
>  at 
> kafka.zk.KafkaZkClient.retryReque

[jira] [Commented] (KAFKA-6879) Controller deadlock following session expiration

2020-01-17 Thread leibo (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-6879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17017827#comment-17017827
 ] 

leibo commented on KAFKA-6879:
--

Hello [~hachikuji],  I have met this issue on kafka 2.1.1 many times , and the 
description is here https://issues.apache.org/jira/browse/KAFKA-8532

 

So I think the controller deadlock problem is not solved completely.

> Controller deadlock following session expiration
> 
>
> Key: KAFKA-6879
> URL: https://issues.apache.org/jira/browse/KAFKA-6879
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 1.1.0
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Critical
> Fix For: 1.1.1, 2.0.0
>
>
> We have observed an apparent deadlock situation which occurs following a 
> session expiration. The suspected deadlock occurs between the zookeeper 
> "initializationLock" and the latch inside the Expire event which we use to 
> ensure all events have been handled.
> In the logs, we see the "Session expired" message following acquisition of 
> the initialization lock: 
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/zookeeper/ZooKeeperClient.scala#L358
> But we never see any logs indicating that the new session is being 
> initialized. In fact, the controller logs are basically empty from that point 
> on. The problem we suspect is that completion of the 
> {{beforeInitializingSession}} callback requires that all events have finished 
> processing in order to count down the latch: 
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/controller/KafkaController.scala#L1525.
> But an event which was dequeued just prior to the acquisition of the write 
> lock may be unable to complete because it is awaiting acquisition of the 
> initialization lock: 
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/zookeeper/ZooKeeperClient.scala#L137.
> The impact is that the broker continues in a zombie state. It continues 
> fetching and is periodically added to ISRs, but it never receives any further 
> requests from the controller since it is not registered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2020-01-16 Thread leibo (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leibo updated KAFKA-8532:
-
Description: 
We have observed a serious deadlock between controller-event-thead and 
zk-session-expirey-handle thread. When this issue occurred, it's only one way 
to recovery the kafka cluster is restart kafka server. The  follows is the 
jstack log of controller-event-thead and zk-session-expiry-handle thread.

"zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 
tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000]
 java.lang.Thread.State: WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 - parking to wait for <0x0005ee3f7000> (a 
java.util.concurrent.CountDownLatch$Sync)
 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
 at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // 
等待controller-event-thread线程处理expireEvent
 at 
kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533)
 at 
kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173)
 at 
kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408)
 at 
kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374)
 at 
kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374)
 at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown 
Source)
 at scala.collection.Iterator.foreach(Iterator.scala:937)
 at scala.collection.Iterator.foreach$(Iterator.scala:937)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1425)
 at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209)
 at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374)
 at 
kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428)
 at kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown 
Source)
 at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114)
 at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown 
Source)
 at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
 at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)

Locked ownable synchronizers:
 - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker)

"controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 nid=0x310 
waiting on condition [0x7fccb55c8000]
 java.lang.Thread.State: WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 - parking to wait for <0x0005d1be5a00> (a 
java.util.concurrent.CountDownLatch$Sync)
 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
 at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
 at kafka.zookeeper.ZooKeeperClient.handleRequests(ZooKeeperClient.scala:157)
 at kafka.zk.KafkaZkClient.retryRequestsUntilConnected(KafkaZkClient.scala:1596)
 at 
kafka.zk.KafkaZkClient.kafka$zk$KafkaZkClient$$retryRequestUntilConnected(KafkaZkClient.scala:1589)
 at 
kafka.zk.KafkaZkClient.deletePreferredReplicaElection(KafkaZkClient.scala:989)
 at 
kafka.controller.KafkaController.removePartitionsFromPreferredReplicaElection(KafkaController.scala:873)
 at 
kafka.controller.KafkaController.kafka$controller$KafkaController$$onPreferredReplicaElection(KafkaController.scala:631)
 at 
kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:266)
 at 
kafka.controller.KafkaController.kafka$controller$KafkaController$$elect(KafkaController.scala:1221)
 at 
kafka.controller.KafkaController$Reelect$.pr

[jira] [Updated] (KAFKA-9366) please consider upgrade log4j to log4j2 due to critical security problem CVE-2019-17571

2020-01-05 Thread leibo (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leibo updated KAFKA-9366:
-
Description: 
h2. CVE-2019-17571 Detail

Included in Log4j 1.2 is a SocketServer class that is vulnerable to 
deserialization of untrusted data which can be exploited to remotely execute 
arbitrary code when combined with a deserialization gadget when listening to 
untrusted network traffic for log data. This affects Log4j versions up to 1.2 
up to 1.2.17.

 

[https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-17571]

 

  was:
h2. CVE-2019-17571 Detail

Included in Log4j 1.2 is a SocketServer class that is vulnerable to 
deserialization of untrusted data which can be exploited to remotely execute 
arbitrary code when combined with a deserialization gadget when listening to 
untrusted network traffic for log data. This affects Log4j versions up to 1.2 
up to 1.2.17.

 


> please consider upgrade log4j to log4j2 due to critical security problem 
> CVE-2019-17571
> ---
>
> Key: KAFKA-9366
> URL: https://issues.apache.org/jira/browse/KAFKA-9366
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.2.0, 2.1.1, 2.3.0, 2.4.0
>Reporter: leibo
>Priority: Critical
>
> h2. CVE-2019-17571 Detail
> Included in Log4j 1.2 is a SocketServer class that is vulnerable to 
> deserialization of untrusted data which can be exploited to remotely execute 
> arbitrary code when combined with a deserialization gadget when listening to 
> untrusted network traffic for log data. This affects Log4j versions up to 1.2 
> up to 1.2.17.
>  
> [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-17571]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-9366) please consider upgrade log4j to log4j2 due to critical security problem CVE-2019-17571

2020-01-05 Thread leibo (Jira)
leibo created KAFKA-9366:


 Summary: please consider upgrade log4j to log4j2 due to critical 
security problem CVE-2019-17571
 Key: KAFKA-9366
 URL: https://issues.apache.org/jira/browse/KAFKA-9366
 Project: Kafka
  Issue Type: Bug
  Components: core
Affects Versions: 2.4.0, 2.3.0, 2.1.1, 2.2.0
Reporter: leibo


h2. CVE-2019-17571 Detail

Included in Log4j 1.2 is a SocketServer class that is vulnerable to 
deserialization of untrusted data which can be exploited to remotely execute 
arbitrary code when combined with a deserialization gadget when listening to 
untrusted network traffic for log data. This affects Log4j versions up to 1.2 
up to 1.2.17.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2019-09-18 Thread leibo (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leibo updated KAFKA-8532:
-
Affects Version/s: (was: 2.3.0)

> controller-event-thread deadlock with zk-session-expiry-handler0
> 
>
> Key: KAFKA-8532
> URL: https://issues.apache.org/jira/browse/KAFKA-8532
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Blocker
> Attachments: js.log, js0.log, js1.log, js2.log
>
>
> We have observed a serious deadlock between controller-event-thead and 
> zk-session-expirey-handle thread. When this issue occurred, it's only one way 
> to recovery the kafka cluster is restart kafka server. The  follows is the 
> jstack log of controller-event-thead and zk-session-expiry-handle thread.
> "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 
> tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005ee3f7000> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // 
> 等待controller-event-thread线程处理expireEvent
>  at 
> kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533)
>  at 
> kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173)
>  at 
> kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374)
>  at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown 
> Source)
>  at scala.collection.Iterator.foreach(Iterator.scala:937)
>  at scala.collection.Iterator.foreach$(Iterator.scala:937)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1425)
>  at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209)
>  at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428)
>  at 
> kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114)
>  at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> Locked ownable synchronizers:
>  - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker)
> "controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 
> nid=0x310 waiting on condition [0x7fccb55c8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005d1be5a00> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
>  at kafka.zookeeper.ZooKeeperClient.handleRequests(ZooKeeperClient.scala:157)
>  at 
> kafka.zk.KafkaZkClient.retryRequestsUntilConnected(KafkaZkClient.scala:1596)
>  at 
> kafka.zk.

[jira] [Updated] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2019-09-17 Thread leibo (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leibo updated KAFKA-8532:
-
Affects Version/s: 2.3.0

> controller-event-thread deadlock with zk-session-expiry-handler0
> 
>
> Key: KAFKA-8532
> URL: https://issues.apache.org/jira/browse/KAFKA-8532
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1, 2.3.0
>Reporter: leibo
>Priority: Blocker
> Attachments: js.log, js0.log, js1.log, js2.log
>
>
> We have observed a serious deadlock between controller-event-thead and 
> zk-session-expirey-handle thread. When this issue occurred, it's only one way 
> to recovery the kafka cluster is restart kafka server. The  follows is the 
> jstack log of controller-event-thead and zk-session-expiry-handle thread.
> "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 
> tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005ee3f7000> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // 
> 等待controller-event-thread线程处理expireEvent
>  at 
> kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533)
>  at 
> kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173)
>  at 
> kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374)
>  at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown 
> Source)
>  at scala.collection.Iterator.foreach(Iterator.scala:937)
>  at scala.collection.Iterator.foreach$(Iterator.scala:937)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1425)
>  at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209)
>  at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428)
>  at 
> kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114)
>  at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> Locked ownable synchronizers:
>  - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker)
> "controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 
> nid=0x310 waiting on condition [0x7fccb55c8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005d1be5a00> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
>  at kafka.zookeeper.ZooKeeperClient.handleRequests(ZooKeeperClient.scala:157)
>  at 
> kafka.zk.KafkaZkClient.retryRequestsUntilConnected(KafkaZkClient.scala:1596)
>  at 
> kafka.zk.Kafk

[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2019-09-17 Thread leibo (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932042#comment-16932042
 ] 

leibo edited comment on KAFKA-8532 at 9/18/19 4:27 AM:
---

[~junrao], [~ijuma]

My analysis: 

      When kafka disconnected with zookeeper,  Controller-event-thread is 
handing ISRChangeNotification event,  because kafka is disconnected with zk,  
controller event handle is blocked on retryRequestUntilConnected -> 
handleRequest method.

     At this time, zk-session-expired-handle-thread is try to reinitialize a 
new zk connection with zookeeper, it put a Controller event named Expire to 
ControllerEventManager LinkedBlockingQueue queue, due to 
controller-event-thread is blocked on ISRChangeNotification event, the Expire 
event will not be handle by ControllerEventThread.

    The problem is obviously. ISRChangeNotification  and Expire are block each 
other.

The problem is here:
{code:java}
//代码占位符
def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] 
= {
  if (requests.isEmpty)
Seq.empty
  else {
val countDownLatch = new CountDownLatch(requests.size)
val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size)

requests.foreach { request =>
  inFlightRequests.acquire()
  try {
inReadLock(initializationLock) {
  send(request) { response =>
responseQueue.add(response)
inFlightRequests.release()
countDownLatch.countDown()
  }
}
  } catch {
case e: Throwable =>
  inFlightRequests.release()
  throw e
  }
}
countDownLatch.await()
responseQueue.asScala.toBuffer
  }
}
{code}
we can see, if exception is occurred,  countDownLatch.countDown() will not be 
execute, and this method will always blocked here.


was (Author: lbdai3190):
[~junrao], [~ijuma]

My analysis: 

      When kafka disconnected with zookeeper,  Controller-event-thread is 
handing ISRChangeNotification event,  because kafka is disconnected with zk,  
controller event handle is blocked on retryRequestUntilConnected -> 
handleRequest method.

     At this time, zk-session-expired-handle-thread is try to reinitialize a 
new zk connection with zookeeper, it put a Controller event named Expire to 
ControllerEventManager LinkedBlockingQueue queue, due to 
controller-event-thread is blocked on ISRChangeNotification event, the Expire 
event will not be handle.

    The problem is obviously. ISRChangeNotification  and Expire are block each 
other.

The problem is here:
{code:java}
//代码占位符
def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] 
= {
  if (requests.isEmpty)
Seq.empty
  else {
val countDownLatch = new CountDownLatch(requests.size)
val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size)

requests.foreach { request =>
  inFlightRequests.acquire()
  try {
inReadLock(initializationLock) {
  send(request) { response =>
responseQueue.add(response)
inFlightRequests.release()
countDownLatch.countDown()
  }
}
  } catch {
case e: Throwable =>
  inFlightRequests.release()
  throw e
  }
}
countDownLatch.await()
responseQueue.asScala.toBuffer
  }
}
{code}
we can see, if exception is occurred,  countDownLatch.countDown() will not be 
execute, and this method will always blocked here.

> controller-event-thread deadlock with zk-session-expiry-handler0
> 
>
> Key: KAFKA-8532
> URL: https://issues.apache.org/jira/browse/KAFKA-8532
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Blocker
> Attachments: js.log, js0.log, js1.log, js2.log
>
>
> We have observed a serious deadlock between controller-event-thead and 
> zk-session-expirey-handle thread. When this issue occurred, it's only one way 
> to recovery the kafka cluster is restart kafka server. The  follows is the 
> jstack log of controller-event-thead and zk-session-expiry-handle thread.
> "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 
> tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005ee3f7000> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)

[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2019-09-17 Thread leibo (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932042#comment-16932042
 ] 

leibo edited comment on KAFKA-8532 at 9/18/19 4:26 AM:
---

[~junrao], [~ijuma]

My analysis: 

      When kafka disconnected with zookeeper,  Controller-event-thread is 
handing ISRChangeNotification event,  because kafka is disconnected with zk,  
controller event handle is blocked on retryRequestUntilConnected -> 
handleRequest method.

     At this time, zk-session-expired-handle-thread is try to reinitialize a 
new zk connection with zookeeper, it put a Controller event named Expire to 
ControllerEventManager LinkedBlockingQueue queue, due to 
controller-event-thread is blocked on ISRChangeNotification event, the Expire 
event will not be handle.

    The problem is obviously. ISRChangeNotification  and Expire are block each 
other.

The problem is here:
{code:java}
//代码占位符
def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] 
= {
  if (requests.isEmpty)
Seq.empty
  else {
val countDownLatch = new CountDownLatch(requests.size)
val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size)

requests.foreach { request =>
  inFlightRequests.acquire()
  try {
inReadLock(initializationLock) {
  send(request) { response =>
responseQueue.add(response)
inFlightRequests.release()
countDownLatch.countDown()
  }
}
  } catch {
case e: Throwable =>
  inFlightRequests.release()
  throw e
  }
}
countDownLatch.await()
responseQueue.asScala.toBuffer
  }
}
{code}
we can see, if exception is occurred,  countDownLatch.countDown() will not be 
execute, and this method will always blocked here.


was (Author: lbdai3190):
[~junrao], [~ijuma]

My analysis: 

      When kafka disconnected with zookeeper,  Controller-event-thread is 
handing ISRChangeNotification event,  because kafka is disconnected with zk,  
controller event handle is blocked on retryRequestUntilConnected -> 
handleRequest method.

     At this time, zk-session-expired-handle-thread is try to reinitialize a 
new zk connection with zookeeper, it put a Controller event named 

Expire to ControllerEventManager LinkedBlockingQueue queue, due to 
controller-event-thread is blocked on ISRChangeNotification event, the Expire 
event will not be handle.

    The problem is obviously. ISRChangeNotification  and Expire are block each 
other.

The problem is here:
{code:java}
//代码占位符
def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] 
= {
  if (requests.isEmpty)
Seq.empty
  else {
val countDownLatch = new CountDownLatch(requests.size)
val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size)

requests.foreach { request =>
  inFlightRequests.acquire()
  try {
inReadLock(initializationLock) {
  send(request) { response =>
responseQueue.add(response)
inFlightRequests.release()
countDownLatch.countDown()
  }
}
  } catch {
case e: Throwable =>
  inFlightRequests.release()
  throw e
  }
}
countDownLatch.await()
responseQueue.asScala.toBuffer
  }
}
{code}
we can see, if exception is occurred,  countDownLatch.countDown() will not be 
execute, and this method will always blocked here.

> controller-event-thread deadlock with zk-session-expiry-handler0
> 
>
> Key: KAFKA-8532
> URL: https://issues.apache.org/jira/browse/KAFKA-8532
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Blocker
> Attachments: js.log, js0.log, js1.log, js2.log
>
>
> We have observed a serious deadlock between controller-event-thead and 
> zk-session-expirey-handle thread. When this issue occurred, it's only one way 
> to recovery the kafka cluster is restart kafka server. The  follows is the 
> jstack log of controller-event-thead and zk-session-expiry-handle thread.
> "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 
> tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005ee3f7000> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.conc

[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2019-09-17 Thread leibo (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932042#comment-16932042
 ] 

leibo edited comment on KAFKA-8532 at 9/18/19 4:26 AM:
---

[~junrao], [~ijuma]

My analysis: 

      When kafka disconnected with zookeeper,  Controller-event-thread is 
handing ISRChangeNotification event,  because kafka is disconnected with zk,  
controller event handle is blocked on retryRequestUntilConnected -> 
handleRequest method.

     At this time, zk-session-expired-handle-thread is try to reinitialize a 
new zk connection with zookeeper, it put a Controller event named 

Expire to ControllerEventManager LinkedBlockingQueue queue, due to 
controller-event-thread is blocked on ISRChangeNotification event, the Expire 
event will not be handle.

    The problem is obviously. ISRChangeNotification  and Expire are block each 
other.

The problem is here:
{code:java}
//代码占位符
def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] 
= {
  if (requests.isEmpty)
Seq.empty
  else {
val countDownLatch = new CountDownLatch(requests.size)
val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size)

requests.foreach { request =>
  inFlightRequests.acquire()
  try {
inReadLock(initializationLock) {
  send(request) { response =>
responseQueue.add(response)
inFlightRequests.release()
countDownLatch.countDown()
  }
}
  } catch {
case e: Throwable =>
  inFlightRequests.release()
  throw e
  }
}
countDownLatch.await()
responseQueue.asScala.toBuffer
  }
}
{code}
we can see, if exception is occurred,  countDownLatch.countDown() will not be 
execute, and this method will always blocked here.


was (Author: lbdai3190):
[~junrao], [~ijuma]

My analysis: 

      When kafka disconnected with zookeeper,  Controller-event-thread is 
handing ISRChangeNotification event,  because kafka is disconnected with zk,  
controller event handle is blocked on retryRequestUntilConnected -> 
handleRequest method.

     At this time, zk-session-expired-handle thread is try to reinitialize a 
new zk connection with zookeeper, it put a Controller event named 

Expire to ControllerEventManager LinkedBlockingQueue queue, due to 
controller-event-thread is blocked on ISRChangeNotification event, the Expire 
event will not be handle.

    The problem is obviously. ISRChangeNotification  and Expire are block each 
other.

The problem is here:
{code:java}
//代码占位符
def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] 
= {
  if (requests.isEmpty)
Seq.empty
  else {
val countDownLatch = new CountDownLatch(requests.size)
val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size)

requests.foreach { request =>
  inFlightRequests.acquire()
  try {
inReadLock(initializationLock) {
  send(request) { response =>
responseQueue.add(response)
inFlightRequests.release()
countDownLatch.countDown()
  }
}
  } catch {
case e: Throwable =>
  inFlightRequests.release()
  throw e
  }
}
countDownLatch.await()
responseQueue.asScala.toBuffer
  }
}
{code}
we can see, if exception is occurred,  countDownLatch.countDown() will not be 
execute, and this method will always blocked here.

> controller-event-thread deadlock with zk-session-expiry-handler0
> 
>
> Key: KAFKA-8532
> URL: https://issues.apache.org/jira/browse/KAFKA-8532
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Blocker
> Attachments: js.log, js0.log, js1.log, js2.log
>
>
> We have observed a serious deadlock between controller-event-thead and 
> zk-session-expirey-handle thread. When this issue occurred, it's only one way 
> to recovery the kafka cluster is restart kafka server. The  follows is the 
> jstack log of controller-event-thead and zk-session-expiry-handle thread.
> "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 
> tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005ee3f7000> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.con

[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2019-09-17 Thread leibo (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932042#comment-16932042
 ] 

leibo edited comment on KAFKA-8532 at 9/18/19 4:24 AM:
---

[~junrao]

My analysis: 

      When kafka disconnected with zookeeper,  Controller-event-thread is 
handing ISRChangeNotification event,  because kafka is disconnected with zk,  
controller event handle is blocked on retryRequestUntilConnected -> 
handleRequest method.

     At this time, zk-session-expired-handle thread is try to reinitialize a 
new zk connection with zookeeper, it put a Controller event named 

Expire to ControllerEventManager LinkedBlockingQueue queue, due to 
controller-event-thread is blocked on ISRChangeNotification event, the Expire 
event will not be handle.

    The problem is obviously. ISRChangeNotification  and Expire are block each 
other.

The problem is here:
{code:java}
//代码占位符
def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] 
= {
  if (requests.isEmpty)
Seq.empty
  else {
val countDownLatch = new CountDownLatch(requests.size)
val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size)

requests.foreach { request =>
  inFlightRequests.acquire()
  try {
inReadLock(initializationLock) {
  send(request) { response =>
responseQueue.add(response)
inFlightRequests.release()
countDownLatch.countDown()
  }
}
  } catch {
case e: Throwable =>
  inFlightRequests.release()
  throw e
  }
}
countDownLatch.await()
responseQueue.asScala.toBuffer
  }
}
{code}
we can see, if exception is occurred,  countDownLatch.countDown() will not be 
execute, and this method will always blocked here.


was (Author: lbdai3190):
My analysis: 

      When kafka disconnected with zookeeper,  Controller-event-thread is 
handing ISRChangeNotification event,  because kafka is disconnected with zk,  
controller event handle is blocked on retryRequestUntilConnected -> 
handleRequest method.

     At this time, zk-session-expired-handle thread is try to reinitialize a 
new zk connection with zookeeper, it put a Controller event named 

Expire to ControllerEventManager LinkedBlockingQueue queue, due to 
controller-event-thread is blocked on ISRChangeNotification event, the Expire 
event will not be handle.

    The problem is obviously. ISRChangeNotification  and Expire are block each 
other.

The problem is here:
{code:java}
//代码占位符
def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] 
= {
  if (requests.isEmpty)
Seq.empty
  else {
val countDownLatch = new CountDownLatch(requests.size)
val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size)

requests.foreach { request =>
  inFlightRequests.acquire()
  try {
inReadLock(initializationLock) {
  send(request) { response =>
responseQueue.add(response)
inFlightRequests.release()
countDownLatch.countDown()
  }
}
  } catch {
case e: Throwable =>
  inFlightRequests.release()
  throw e
  }
}
countDownLatch.await()
responseQueue.asScala.toBuffer
  }
}
{code}
we can see, if exception is occurred,  countDownLatch.countDown() will not be 
execute, and this method will always blocked here.

> controller-event-thread deadlock with zk-session-expiry-handler0
> 
>
> Key: KAFKA-8532
> URL: https://issues.apache.org/jira/browse/KAFKA-8532
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Blocker
> Attachments: js.log, js0.log, js1.log, js2.log
>
>
> We have observed a serious deadlock between controller-event-thead and 
> zk-session-expirey-handle thread. When this issue occurred, it's only one way 
> to recovery the kafka cluster is restart kafka server. The  follows is the 
> jstack log of controller-event-thead and zk-session-expiry-handle thread.
> "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 
> tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005ee3f7000> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSyn

[jira] [Commented] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2019-09-17 Thread leibo (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932042#comment-16932042
 ] 

leibo commented on KAFKA-8532:
--

My analysis: 

      When kafka disconnected with zookeeper,  Controller-event-thread is 
handing ISRChangeNotification event,  because kafka is disconnected with zk,  
controller event handle is blocked on retryRequestUntilConnected -> 
handleRequest method.

     At this time, zk-session-expired-handle thread is try to reinitialize a 
new zk connection with zookeeper, it put a Controller event named 

Expire to ControllerEventManager LinkedBlockingQueue queue, due to 
controller-event-thread is blocked on ISRChangeNotification event, the Expire 
event will not be handle.

    The problem is obviously. ISRChangeNotification  and Expire are block each 
other.

The problem is here:
{code:java}
//代码占位符
def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] 
= {
  if (requests.isEmpty)
Seq.empty
  else {
val countDownLatch = new CountDownLatch(requests.size)
val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size)

requests.foreach { request =>
  inFlightRequests.acquire()
  try {
inReadLock(initializationLock) {
  send(request) { response =>
responseQueue.add(response)
inFlightRequests.release()
countDownLatch.countDown()
  }
}
  } catch {
case e: Throwable =>
  inFlightRequests.release()
  throw e
  }
}
countDownLatch.await()
responseQueue.asScala.toBuffer
  }
}
{code}
we can see, if exception is occurred,  countDownLatch.countDown() will not be 
execute, and this method will always blocked here.

> controller-event-thread deadlock with zk-session-expiry-handler0
> 
>
> Key: KAFKA-8532
> URL: https://issues.apache.org/jira/browse/KAFKA-8532
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Blocker
> Attachments: js.log, js0.log, js1.log, js2.log
>
>
> We have observed a serious deadlock between controller-event-thead and 
> zk-session-expirey-handle thread. When this issue occurred, it's only one way 
> to recovery the kafka cluster is restart kafka server. The  follows is the 
> jstack log of controller-event-thead and zk-session-expiry-handle thread.
> "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 
> tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005ee3f7000> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // 
> 等待controller-event-thread线程处理expireEvent
>  at 
> kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533)
>  at 
> kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173)
>  at 
> kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374)
>  at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown 
> Source)
>  at scala.collection.Iterator.foreach(Iterator.scala:937)
>  at scala.collection.Iterator.foreach$(Iterator.scala:937)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1425)
>  at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209)
>  at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428)
>  at 
> kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114)
>  at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(Fut

[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2019-09-17 Thread leibo (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932042#comment-16932042
 ] 

leibo edited comment on KAFKA-8532 at 9/18/19 4:25 AM:
---

[~junrao], [~ijuma]

My analysis: 

      When kafka disconnected with zookeeper,  Controller-event-thread is 
handing ISRChangeNotification event,  because kafka is disconnected with zk,  
controller event handle is blocked on retryRequestUntilConnected -> 
handleRequest method.

     At this time, zk-session-expired-handle thread is try to reinitialize a 
new zk connection with zookeeper, it put a Controller event named 

Expire to ControllerEventManager LinkedBlockingQueue queue, due to 
controller-event-thread is blocked on ISRChangeNotification event, the Expire 
event will not be handle.

    The problem is obviously. ISRChangeNotification  and Expire are block each 
other.

The problem is here:
{code:java}
//代码占位符
def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] 
= {
  if (requests.isEmpty)
Seq.empty
  else {
val countDownLatch = new CountDownLatch(requests.size)
val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size)

requests.foreach { request =>
  inFlightRequests.acquire()
  try {
inReadLock(initializationLock) {
  send(request) { response =>
responseQueue.add(response)
inFlightRequests.release()
countDownLatch.countDown()
  }
}
  } catch {
case e: Throwable =>
  inFlightRequests.release()
  throw e
  }
}
countDownLatch.await()
responseQueue.asScala.toBuffer
  }
}
{code}
we can see, if exception is occurred,  countDownLatch.countDown() will not be 
execute, and this method will always blocked here.


was (Author: lbdai3190):
[~junrao]

My analysis: 

      When kafka disconnected with zookeeper,  Controller-event-thread is 
handing ISRChangeNotification event,  because kafka is disconnected with zk,  
controller event handle is blocked on retryRequestUntilConnected -> 
handleRequest method.

     At this time, zk-session-expired-handle thread is try to reinitialize a 
new zk connection with zookeeper, it put a Controller event named 

Expire to ControllerEventManager LinkedBlockingQueue queue, due to 
controller-event-thread is blocked on ISRChangeNotification event, the Expire 
event will not be handle.

    The problem is obviously. ISRChangeNotification  and Expire are block each 
other.

The problem is here:
{code:java}
//代码占位符
def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] 
= {
  if (requests.isEmpty)
Seq.empty
  else {
val countDownLatch = new CountDownLatch(requests.size)
val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size)

requests.foreach { request =>
  inFlightRequests.acquire()
  try {
inReadLock(initializationLock) {
  send(request) { response =>
responseQueue.add(response)
inFlightRequests.release()
countDownLatch.countDown()
  }
}
  } catch {
case e: Throwable =>
  inFlightRequests.release()
  throw e
  }
}
countDownLatch.await()
responseQueue.asScala.toBuffer
  }
}
{code}
we can see, if exception is occurred,  countDownLatch.countDown() will not be 
execute, and this method will always blocked here.

> controller-event-thread deadlock with zk-session-expiry-handler0
> 
>
> Key: KAFKA-8532
> URL: https://issues.apache.org/jira/browse/KAFKA-8532
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Blocker
> Attachments: js.log, js0.log, js1.log, js2.log
>
>
> We have observed a serious deadlock between controller-event-thead and 
> zk-session-expirey-handle thread. When this issue occurred, it's only one way 
> to recovery the kafka cluster is restart kafka server. The  follows is the 
> jstack log of controller-event-thead and zk-session-expiry-handle thread.
> "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 
> tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005ee3f7000> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.lo

[jira] [Commented] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2019-09-17 Thread leibo (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932006#comment-16932006
 ] 

leibo commented on KAFKA-8532:
--

Hi, [~junrao] , the problem is happened again,   the cluster jstack dump log 
were uploaded .

please help to analysis this problem, thx.

> controller-event-thread deadlock with zk-session-expiry-handler0
> 
>
> Key: KAFKA-8532
> URL: https://issues.apache.org/jira/browse/KAFKA-8532
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Blocker
> Attachments: js.log, js0.log, js1.log, js2.log
>
>
> We have observed a serious deadlock between controller-event-thead and 
> zk-session-expirey-handle thread. When this issue occurred, it's only one way 
> to recovery the kafka cluster is restart kafka server. The  follows is the 
> jstack log of controller-event-thead and zk-session-expiry-handle thread.
> "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 
> tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005ee3f7000> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // 
> 等待controller-event-thread线程处理expireEvent
>  at 
> kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533)
>  at 
> kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173)
>  at 
> kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374)
>  at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown 
> Source)
>  at scala.collection.Iterator.foreach(Iterator.scala:937)
>  at scala.collection.Iterator.foreach$(Iterator.scala:937)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1425)
>  at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209)
>  at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428)
>  at 
> kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114)
>  at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> Locked ownable synchronizers:
>  - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker)
> "controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 
> nid=0x310 waiting on condition [0x7fccb55c8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005d1be5a00> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
>  at kafka.zookeeper.ZooKee

[jira] [Updated] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2019-09-17 Thread leibo (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leibo updated KAFKA-8532:
-
Attachment: js2.log

> controller-event-thread deadlock with zk-session-expiry-handler0
> 
>
> Key: KAFKA-8532
> URL: https://issues.apache.org/jira/browse/KAFKA-8532
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Blocker
> Attachments: js.log, js0.log, js1.log, js2.log
>
>
> We have observed a serious deadlock between controller-event-thead and 
> zk-session-expirey-handle thread. When this issue occurred, it's only one way 
> to recovery the kafka cluster is restart kafka server. The  follows is the 
> jstack log of controller-event-thead and zk-session-expiry-handle thread.
> "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 
> tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005ee3f7000> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // 
> 等待controller-event-thread线程处理expireEvent
>  at 
> kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533)
>  at 
> kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173)
>  at 
> kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374)
>  at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown 
> Source)
>  at scala.collection.Iterator.foreach(Iterator.scala:937)
>  at scala.collection.Iterator.foreach$(Iterator.scala:937)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1425)
>  at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209)
>  at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428)
>  at 
> kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114)
>  at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> Locked ownable synchronizers:
>  - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker)
> "controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 
> nid=0x310 waiting on condition [0x7fccb55c8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005d1be5a00> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
>  at kafka.zookeeper.ZooKeeperClient.handleRequests(ZooKeeperClient.scala:157)
>  at 
> kafka.zk.KafkaZkClient.retryRequestsUntilConnected(KafkaZkClient.scala:1596)
>  at 
> kafka.zk.KafkaZkClient.ka

[jira] [Updated] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2019-09-17 Thread leibo (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leibo updated KAFKA-8532:
-
Attachment: js0.log

> controller-event-thread deadlock with zk-session-expiry-handler0
> 
>
> Key: KAFKA-8532
> URL: https://issues.apache.org/jira/browse/KAFKA-8532
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Blocker
> Attachments: js.log, js0.log, js1.log
>
>
> We have observed a serious deadlock between controller-event-thead and 
> zk-session-expirey-handle thread. When this issue occurred, it's only one way 
> to recovery the kafka cluster is restart kafka server. The  follows is the 
> jstack log of controller-event-thead and zk-session-expiry-handle thread.
> "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 
> tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005ee3f7000> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // 
> 等待controller-event-thread线程处理expireEvent
>  at 
> kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533)
>  at 
> kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173)
>  at 
> kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374)
>  at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown 
> Source)
>  at scala.collection.Iterator.foreach(Iterator.scala:937)
>  at scala.collection.Iterator.foreach$(Iterator.scala:937)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1425)
>  at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209)
>  at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428)
>  at 
> kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114)
>  at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> Locked ownable synchronizers:
>  - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker)
> "controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 
> nid=0x310 waiting on condition [0x7fccb55c8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005d1be5a00> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
>  at kafka.zookeeper.ZooKeeperClient.handleRequests(ZooKeeperClient.scala:157)
>  at 
> kafka.zk.KafkaZkClient.retryRequestsUntilConnected(KafkaZkClient.scala:1596)
>  at 
> kafka.zk.KafkaZkClient.kafka$zk$Ka

[jira] [Updated] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2019-09-17 Thread leibo (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leibo updated KAFKA-8532:
-
Attachment: js1.log

> controller-event-thread deadlock with zk-session-expiry-handler0
> 
>
> Key: KAFKA-8532
> URL: https://issues.apache.org/jira/browse/KAFKA-8532
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Blocker
> Attachments: js.log, js0.log, js1.log
>
>
> We have observed a serious deadlock between controller-event-thead and 
> zk-session-expirey-handle thread. When this issue occurred, it's only one way 
> to recovery the kafka cluster is restart kafka server. The  follows is the 
> jstack log of controller-event-thead and zk-session-expiry-handle thread.
> "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 
> tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005ee3f7000> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // 
> 等待controller-event-thread线程处理expireEvent
>  at 
> kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533)
>  at 
> kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173)
>  at 
> kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374)
>  at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown 
> Source)
>  at scala.collection.Iterator.foreach(Iterator.scala:937)
>  at scala.collection.Iterator.foreach$(Iterator.scala:937)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1425)
>  at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209)
>  at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428)
>  at 
> kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114)
>  at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> Locked ownable synchronizers:
>  - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker)
> "controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 
> nid=0x310 waiting on condition [0x7fccb55c8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005d1be5a00> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
>  at kafka.zookeeper.ZooKeeperClient.handleRequests(ZooKeeperClient.scala:157)
>  at 
> kafka.zk.KafkaZkClient.retryRequestsUntilConnected(KafkaZkClient.scala:1596)
>  at 
> kafka.zk.KafkaZkClient.kafka$zk$Ka

[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2019-06-17 Thread leibo (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865266#comment-16865266
 ] 

leibo edited comment on KAFKA-8532 at 6/17/19 11:15 AM:


[~ijuma] I have tried to reproduced this ,but failed,  it will appear after i 
restart kafka cluster a few days.

The following info may be helpful:
 # We are running kafka cluster into docker container managed by kubernetes.
 # Until now, every time this issue occurred, we found the zookeeper container 
CPU are set to 1C while the host have more than 50C CPU.
 # When we set to zookeeper container 1C cpu(The host have more than 50C cpu),  
due to docker's problem, the zookeeper running in docker container will have 
more than 38 GC work threads(According to JVM official document, 1C cpu should 
correspond 2 GC work threads), these gc work thread may slow down zookeeper , 
and maybe let kafka and zookeeper session timed out.
 #  After we update zookeeper jvm *-XX:ParallelGCThreads=4* , this issue is not 
occurred yet, but I think this Optimizing measure is just reduce the 
probability of this problem, not solved it.

According to above info, I think when this problem occurred , the zookeeper is 
not running well, and kafka cluster deadlock occurred, but after some time, 
zookeeper recover normal running, kafka is not .


was (Author: lbdai3190):
[~ijuma] I have tried to reproduced this ,but failed,  it will appear after i 
restart kafka cluster a few days.

The following info may be helpful:
 # We are running kafka cluster in to docker container managed by kubernetes.
 # Until now, every time this issue occurred, we found the zookeeper container 
CPU are set to 1C while the host have more than 50C CPU.
 # When we set to zookeeper container 1C cpu(The host have more than 50C cpu),  
due to docker's problem, the zookeeper running in docker container will have 
more than 38 GC work threads(According to JVM official document, 1C cpu should 
correspond 2 GC work threads), these gc work thread may slow down zookeeper , 
and maybe let kafka and zookeeper session timed out.
 #  After we update zookeeper jvm *-XX:ParallelGCThreads=4* , this issue is not 
occurred yet, but I think this Optimizing measure is just reduce the 
probability of this problem, not solved it.

According to above info, I think when this problem occurred , the zookeeper is 
not running well, and kafka cluster deadlock occurred, but after some time, 
zookeeper recover normal running, kafka is not .

> controller-event-thread deadlock with zk-session-expiry-handler0
> 
>
> Key: KAFKA-8532
> URL: https://issues.apache.org/jira/browse/KAFKA-8532
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Blocker
> Attachments: js.log
>
>
> We have observed a serious deadlock between controller-event-thead and 
> zk-session-expirey-handle thread. When this issue occurred, it's only one way 
> to recovery the kafka cluster is restart kafka server. The  follows is the 
> jstack log of controller-event-thead and zk-session-expiry-handle thread.
> "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 
> tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005ee3f7000> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // 
> 等待controller-event-thread线程处理expireEvent
>  at 
> kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533)
>  at 
> kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173)
>  at 
> kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374)
>  at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown 
> Source)
>  at scala.collection.Iterator.foreach(Iterator.scala:937)
>  at scala.collection.Iterator.foreach$(Iterator.scala:937)
>  at scala.collection.AbstractIterator.foreach(Ite

[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2019-06-16 Thread leibo (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865266#comment-16865266
 ] 

leibo edited comment on KAFKA-8532 at 6/17/19 2:53 AM:
---

[~ijuma] I have tried to reproduced this ,but failed,  it will appear after i 
restart kafka cluster a few days.

The following info may be helpful:
 # We are running kafka cluster in to docker container managed by kubernetes.
 # Until now, every time this issue occurred, we found the zookeeper container 
CPU are set to 1C while the host have more than 50C CPU.
 # When we set to zookeeper container 1C cpu(The host have more than 50C cpu),  
due to docker's problem, the zookeeper running in docker container will have 
more than 38 GC work threads(According to JVM official document, 1C cpu should 
correspond 2 GC work threads), these gc work thread may slow down zookeeper , 
and maybe let kafka and zookeeper session timed out.
 #  After we update zookeeper jvm *-XX:ParallelGCThreads=4* , this issue is not 
occurred yet, but I think this Optimizing measure is just reduce the 
probability of this problem, not solved it.

According to above info, I think when this problem occurred , the zookeeper is 
not running well, and kafka cluster deadlock occurred, but after some time, 
zookeeper recover normal running, kafka is not .


was (Author: lbdai3190):
[~ijuma] I have tried to reproduced this ,but failed,  it will appear after i 
restart kafka cluster a few days.

The following info may be helpful:
 # We are running kafka cluster in to docker container managed by kubernetes.
 # Until now, every time this issue occurred, we found the zookeeper container 
CPU are set to 1C while the host have more than 50C CPU.
 # When we set to zookeeper container 1C cpu(The host have more than 50C cpu),  
due to docker's problem, the zookeeper running in docker container will have 
more than 38 GC work threads(According to JVM official document, 1C cpu should 
correspond 2 GC work threads), these gc work thread may slow down zookeeper , 
and maybe let kafka and zookeeper session timed out.
 #  After we update zookeeper jvm *-XX:ParallelGCThreads=4* , this issue is not 
occurred yet, but I think this Optimizing measure is just reduce the 
probability of this problem, not solved it.

According to above info, I think when this problem occurred , the zookeeper is 
not running well, and kafka cluster deadlock occurred, but after some time, 
zookeeper recovery normal running, kafka is not .

> controller-event-thread deadlock with zk-session-expiry-handler0
> 
>
> Key: KAFKA-8532
> URL: https://issues.apache.org/jira/browse/KAFKA-8532
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Blocker
> Attachments: js.log
>
>
> We have observed a serious deadlock between controller-event-thead and 
> zk-session-expirey-handle thread. When this issue occurred, it's only one way 
> to recovery the kafka cluster is restart kafka server. The  follows is the 
> jstack log of controller-event-thead and zk-session-expiry-handle thread.
> "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 
> tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005ee3f7000> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // 
> 等待controller-event-thread线程处理expireEvent
>  at 
> kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533)
>  at 
> kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173)
>  at 
> kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374)
>  at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown 
> Source)
>  at scala.collection.Iterator.foreach(Iterator.scala:937)
>  at scala.collection.Iterator.foreach$(Iterator.scala:937)
>  at scala.collection.AbstractIterator.foreach(Ite

[jira] [Commented] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2019-06-16 Thread leibo (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865266#comment-16865266
 ] 

leibo commented on KAFKA-8532:
--

[~ijuma] I have tried to reproduced this ,but failed,  it will appear after i 
restart kafka cluster a few days.

The following info may be helpful:
 # We are running kafka cluster in to docker container managed by kubernetes.
 # Until now, every time this issue occurred, we found the zookeeper container 
CPU are set to 1C while the host have more than 50C CPU.
 # When we set to zookeeper container 1C cpu(The host have more than 50C cpu),  
due to docker's problem, the zookeeper running in docker container will have 
more than 38 GC work threads(According to JVM official document, 1C cpu should 
correspond 2 GC work threads), these gc work thread may slow down zookeeper , 
and maybe let kafka and zookeeper session timed out.
 #  After we update zookeeper jvm *-XX:ParallelGCThreads=4* , this issue is not 
occurred yet, but I think this Optimizing measure is just reduce the 
probability of this problem, not solved it.

According to above info, I think when this problem occurred , the zookeeper is 
not running well, and kafka cluster deadlock occurred, but after some time, 
zookeeper recovery normal running, kafka is not .

> controller-event-thread deadlock with zk-session-expiry-handler0
> 
>
> Key: KAFKA-8532
> URL: https://issues.apache.org/jira/browse/KAFKA-8532
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Blocker
> Attachments: js.log
>
>
> We have observed a serious deadlock between controller-event-thead and 
> zk-session-expirey-handle thread. When this issue occurred, it's only one way 
> to recovery the kafka cluster is restart kafka server. The  follows is the 
> jstack log of controller-event-thead and zk-session-expiry-handle thread.
> "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 
> tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005ee3f7000> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // 
> 等待controller-event-thread线程处理expireEvent
>  at 
> kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533)
>  at 
> kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173)
>  at 
> kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374)
>  at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown 
> Source)
>  at scala.collection.Iterator.foreach(Iterator.scala:937)
>  at scala.collection.Iterator.foreach$(Iterator.scala:937)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1425)
>  at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209)
>  at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428)
>  at 
> kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114)
>  at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:7

[jira] [Commented] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2019-06-13 Thread leibo (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863673#comment-16863673
 ] 

leibo commented on KAFKA-8532:
--

[~junrao] Thank you for your reply.

Because we need to recovery the kafka cluster ,so we have restart the kafka 
server.

Follows is my today's analysis:

I noticed when expiryScheduler connect zookeeper server failed, it will throw 
exception immediately as follow code:
{code:java}
//代码占位符
expiryScheduler.startup()
try waitUntilConnected(connectionTimeoutMs, TimeUnit.MILLISECONDS)
catch {
  case e: Throwable =>
close()
throw e
}{code}
close()
{code:java}
//代码占位符
def close(): Unit = {
  info("Closing.")
  inWriteLock(initializationLock) {
zNodeChangeHandlers.clear()
zNodeChildChangeHandlers.clear()
stateChangeHandlers.clear()
zooKeeper.close()
metricNames.foreach(removeMetric(_))
  }
  // Shutdown scheduler outside of lock to avoid deadlock if scheduler
  // is waiting for lock to process session expiry
  expiryScheduler.shutdown()
  info("Closed.")
}
{code}
expiryScheduler.shutdown() 
{code:java}
//代码占位符
override def shutdown() {
  debug("Shutting down task scheduler.")
  // We use the local variable to avoid NullPointerException if another thread 
shuts down scheduler at same time.
  val cachedExecutor = this.executor
  if (cachedExecutor != null) {
this synchronized {
  cachedExecutor.shutdown()
  this.executor = null
}
cachedExecutor.awaitTermination(1, TimeUnit.DAYS)
  }
}
{code}
{color:#ff}we see expiryScheduler.shutdown() may execute timeout *1 day* , 
it too long.{color}

 

{color:#33}I remember that there are no zookeeper connection or heartbeat 
log in the server.log in 1 day when the issue occurred.{color}

 

{color:#33}I wonder if shutdown() continued for 1 day,  the zookeeper 
instance will not be {color}reinitialize ?{color:#33} {color}

> controller-event-thread deadlock with zk-session-expiry-handler0
> 
>
> Key: KAFKA-8532
> URL: https://issues.apache.org/jira/browse/KAFKA-8532
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Blocker
> Attachments: js.log
>
>
> We have observed a serious deadlock between controller-event-thead and 
> zk-session-expirey-handle thread. When this issue occurred, it's only one way 
> to recovery the kafka cluster is restart kafka server. The  follows is the 
> jstack log of controller-event-thead and zk-session-expiry-handle thread.
> "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 
> tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005ee3f7000> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // 
> 等待controller-event-thread线程处理expireEvent
>  at 
> kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533)
>  at 
> kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173)
>  at 
> kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374)
>  at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown 
> Source)
>  at scala.collection.Iterator.foreach(Iterator.scala:937)
>  at scala.collection.Iterator.foreach$(Iterator.scala:937)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1425)
>  at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209)
>  at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428)
>  at 
> kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114)
>  at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63

[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2019-06-13 Thread leibo (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862668#comment-16862668
 ] 

leibo edited comment on KAFKA-8532 at 6/14/19 2:59 AM:
---

Here is some analysis of mine according to the jstack log and source code:

When the deadlock occurred,
 # The zk-session-expiry-handler0 thread is reinitializing a new zookeeper 
instance, the expireEvent was put to queue of ControllerEventManager and 
waitting to handle by controller-event-handle thread. (*In the time, kafka is 
disconnected with zookeeper yet*).
 # However , in the same time, the controller-event-thread is handling 
RegisterBrokerAndReelect(type of ControllerEvent), as follows:

{code:java}
//KafkaController.scala
case object RegisterBrokerAndReelect extends ControllerEvent {
  override def state: ControllerState = ControllerState.ControllerChange

  override def process(): Unit = {
zkClient.registerBroker(brokerInfo)
Reelect.process()
  }
}
{code}
In this process,  it is blocked at ZookeeperClient.handleRequests, code as 
follows:
{code:java}
//代码占位符
def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] 
= {
  if (requests.isEmpty)
Seq.empty
  else {
val countDownLatch = new CountDownLatch(requests.size)
val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size)

requests.foreach { request =>
  inFlightRequests.acquire()
  try {
inReadLock(initializationLock) {
  send(request) { response =>
responseQueue.add(response)
inFlightRequests.release()
countDownLatch.countDown()
  }
}
  } catch {
case e: Throwable =>
  inFlightRequests.release()
  throw e
  }
}
countDownLatch.await()
responseQueue.asScala.toBuffer
  }
}
{code}
At this time, Zookeeper server is running normally, and all function is working 
well.

I suspect the key point is the *initializationLock* , because if isReadLock is 
success, it will throw exception when kafka disconnected with zookeeper, but we 
never see any exception log in the kafka server.log

controller-event-thread can not doing inReadLock(initializationLock) perhaps 
due to writeLock was not been release.

There are tow way to donging inWriteLock(initailizationLock):

 1.  close():
{code:java}
expiryScheduler.startup()
try waitUntilConnected(connectionTimeoutMs, TimeUnit.MILLISECONDS)
catch {
  case e: Throwable =>
close() //此处close的时候,加写锁
throw e
}

def close(): Unit = {
  info("Closing.")
  inWriteLock(initializationLock) {
zNodeChangeHandlers.clear()
zNodeChildChangeHandlers.clear()
stateChangeHandlers.clear()
zooKeeper.close()
metricNames.foreach(removeMetric(_))
  }
  // Shutdown scheduler outside of lock to avoid deadlock if scheduler
  // is waiting for lock to process session expiry
  expiryScheduler.shutdown()
  info("Closed.")
}
{code}
2.  reinitialize()
{code:java}
private def reinitialize(): Unit = {
  // Initialization callbacks are invoked outside of the lock to avoid deadlock 
potential since their completion
  // may require additional Zookeeper requests, which will block to acquire the 
initialization lock
  stateChangeHandlers.values.foreach(callBeforeInitializingSession _)  //  
zk-session-expiry-handler0 is blocked here

  inWriteLock(initializationLock) {//  program is not running here
if (!connectionState.isAlive) {
  zooKeeper.close()
  info(s"Initializing a new session to $connectString.")
  // retry forever until ZooKeeper can be instantiated
  var connected = false
  while (!connected) {
try {
  zooKeeper = new ZooKeeper(connectString, sessionTimeoutMs, 
ZooKeeperClientWatcher)
  connected = true
} catch {
  case e: Exception =>
info("Error when recreating ZooKeeper, retrying after a short 
sleep", e)
Thread.sleep(1000)
}
  }
}
  }

  stateChangeHandlers.values.foreach(callAfterInitializingSession _)
}
{code}
so i think the problem is occurred in function close()

 


was (Author: lbdai3190):
Here is some analysis of mine according to the jstack log and source code:

When the deadlock occurred,
 # The zk-session-expiry-handler0 thread is reinitializing a new zookeeper 
instance, the expireEvent was put to queue of ControllerEventManager and 
waitting to handle by controller-event-handle thread. (*In the time, kafka is 
disconnected with zookeeper yet*).
 # However , in the same time, the controller-event-thread is handling 
Startup(type of ControllerEvent), as follows:

{code:java}
//KafkaController.scala
case object Startup extends ControllerEvent {

  def state = ControllerState.ControllerChange

  override def process(): Unit = {

zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler)
elect()
  }

[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2019-06-13 Thread leibo (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862772#comment-16862772
 ] 

leibo edited comment on KAFKA-8532 at 6/13/19 12:20 PM:


[~te...@apache.org]  I saw you have handled KAFKA-6879, I wish you can help me 
to analysis this problem, thank you.


was (Author: lbdai3190):
[~te...@apache.org]   As you have handled KAFKA-6879, I wish you can help me to 
analysis this problem, thank you.

> controller-event-thread deadlock with zk-session-expiry-handler0
> 
>
> Key: KAFKA-8532
> URL: https://issues.apache.org/jira/browse/KAFKA-8532
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Blocker
> Attachments: js.log
>
>
> We have observed a serious deadlock between controller-event-thead and 
> zk-session-expirey-handle thread. When this issue occurred, it's only one way 
> to recovery the kafka cluster is restart kafka server. The  follows is the 
> jstack log of controller-event-thead and zk-session-expiry-handle thread.
> "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 
> tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005ee3f7000> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // 
> 等待controller-event-thread线程处理expireEvent
>  at 
> kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533)
>  at 
> kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173)
>  at 
> kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374)
>  at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown 
> Source)
>  at scala.collection.Iterator.foreach(Iterator.scala:937)
>  at scala.collection.Iterator.foreach$(Iterator.scala:937)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1425)
>  at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209)
>  at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428)
>  at 
> kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114)
>  at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> Locked ownable synchronizers:
>  - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker)
> "controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 
> nid=0x310 waiting on condition [0x7fccb55c8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005d1be5a00> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterrup

[jira] [Updated] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2019-06-13 Thread leibo (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leibo updated KAFKA-8532:
-
Description: 
We have observed a serious deadlock between controller-event-thead and 
zk-session-expirey-handle thread. When this issue occurred, it's only one way 
to recovery the kafka cluster is restart kafka server. The  follows is the 
jstack log of controller-event-thead and zk-session-expiry-handle thread.

"zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 
tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000]
 java.lang.Thread.State: WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 - parking to wait for <0x0005ee3f7000> (a 
java.util.concurrent.CountDownLatch$Sync)
 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
 at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // 
等待controller-event-thread线程处理expireEvent
 at 
kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533)
 at 
kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173)
 at 
kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408)
 at 
kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374)
 at 
kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374)
 at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown 
Source)
 at scala.collection.Iterator.foreach(Iterator.scala:937)
 at scala.collection.Iterator.foreach$(Iterator.scala:937)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1425)
 at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209)
 at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374)
 at 
kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428)
 at kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown 
Source)
 at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114)
 at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown 
Source)
 at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
 at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)

Locked ownable synchronizers:
 - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker)

"controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 nid=0x310 
waiting on condition [0x7fccb55c8000]
 java.lang.Thread.State: WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 - parking to wait for <0x0005d1be5a00> (a 
java.util.concurrent.CountDownLatch$Sync)
 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
 at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
 at kafka.zookeeper.ZooKeeperClient.handleRequests(ZooKeeperClient.scala:157)
 at kafka.zk.KafkaZkClient.retryRequestsUntilConnected(KafkaZkClient.scala:1596)
 at 
kafka.zk.KafkaZkClient.kafka$zk$KafkaZkClient$$retryRequestUntilConnected(KafkaZkClient.scala:1589)
 at 
kafka.zk.KafkaZkClient.deletePreferredReplicaElection(KafkaZkClient.scala:989)
 at 
kafka.controller.KafkaController.removePartitionsFromPreferredReplicaElection(KafkaController.scala:873)
 at 
kafka.controller.KafkaController.kafka$controller$KafkaController$$onPreferredReplicaElection(KafkaController.scala:631)
 at 
kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:266)
 at 
kafka.controller.KafkaController.kafka$controller$KafkaController$$elect(KafkaController.scala:1221)
 at 
kafka.controller.KafkaController$Reelect$.pr

[jira] [Commented] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2019-06-12 Thread leibo (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862772#comment-16862772
 ] 

leibo commented on KAFKA-8532:
--

[~te...@apache.org]   As you have handled KAFKA-6879, I wish you can help me to 
analysis this problem, thank you.

> controller-event-thread deadlock with zk-session-expiry-handler0
> 
>
> Key: KAFKA-8532
> URL: https://issues.apache.org/jira/browse/KAFKA-8532
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Blocker
> Attachments: js.log
>
>
> We have observed a serious deadlock between controller-event-thead and 
> zk-session-expirey-handle thread. When this issue occurred, it's only way to 
> recovery the kafka cluster is restart kafka server. The  follows is the 
> jstack log of controller-event-thead and zk-session-expiry-handle thread.
> "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 
> tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005ee3f7000> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // 
> 等待controller-event-thread线程处理expireEvent
>  at 
> kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533)
>  at 
> kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173)
>  at 
> kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374)
>  at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown 
> Source)
>  at scala.collection.Iterator.foreach(Iterator.scala:937)
>  at scala.collection.Iterator.foreach$(Iterator.scala:937)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1425)
>  at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209)
>  at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428)
>  at 
> kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114)
>  at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> Locked ownable synchronizers:
>  - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker)
> "controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 
> nid=0x310 waiting on condition [0x7fccb55c8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005d1be5a00> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
>  at kafka.zookeeper.ZooKeeperClient.handleRequests(ZooKeeperClient.scala:157)
> 

[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2019-06-12 Thread leibo (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862668#comment-16862668
 ] 

leibo edited comment on KAFKA-8532 at 6/13/19 6:46 AM:
---

Here is some analysis of mine according to the jstack log and source code:

When the deadlock occurred,
 # The zk-session-expiry-handler0 thread is reinitializing a new zookeeper 
instance, the expireEvent was put to queue of ControllerEventManager and 
waitting to handle by controller-event-handle thread. (*In the time, kafka is 
disconnected with zookeeper yet*).
 # However , in the same time, the controller-event-thread is handling 
Startup(type of ControllerEvent), as follows:

{code:java}
//KafkaController.scala
case object Startup extends ControllerEvent {

  def state = ControllerState.ControllerChange

  override def process(): Unit = {

zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler)
elect()
  }
}
{code}
In this process,  it is blocked at ZookeeperClient.handleRequests, code as 
follows:
{code:java}
//代码占位符
def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] 
= {
  if (requests.isEmpty)
Seq.empty
  else {
val countDownLatch = new CountDownLatch(requests.size)
val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size)

requests.foreach { request =>
  inFlightRequests.acquire()
  try {
inReadLock(initializationLock) {
  send(request) { response =>
responseQueue.add(response)
inFlightRequests.release()
countDownLatch.countDown()
  }
}
  } catch {
case e: Throwable =>
  inFlightRequests.release()
  throw e
  }
}
countDownLatch.await()
responseQueue.asScala.toBuffer
  }
}
{code}
At this time, Zookeeper server is running normally, and all function is working 
well.

I suspect the key point is the *initializationLock* , because if isReadLock is 
success, it will throw exception when kafka disconnected with zookeeper, but we 
never see any exception log in the kafka server.log

controller-event-thread can not doing inReadLock(initializationLock) perhaps 
due to writeLock was not been release.

There are tow way to donging inWriteLock(initailizationLock):

 1.  close():
{code:java}
expiryScheduler.startup()
try waitUntilConnected(connectionTimeoutMs, TimeUnit.MILLISECONDS)
catch {
  case e: Throwable =>
close() //此处close的时候,加写锁
throw e
}

def close(): Unit = {
  info("Closing.")
  inWriteLock(initializationLock) {
zNodeChangeHandlers.clear()
zNodeChildChangeHandlers.clear()
stateChangeHandlers.clear()
zooKeeper.close()
metricNames.foreach(removeMetric(_))
  }
  // Shutdown scheduler outside of lock to avoid deadlock if scheduler
  // is waiting for lock to process session expiry
  expiryScheduler.shutdown()
  info("Closed.")
}
{code}
2.  reinitialize()
{code:java}
private def reinitialize(): Unit = {
  // Initialization callbacks are invoked outside of the lock to avoid deadlock 
potential since their completion
  // may require additional Zookeeper requests, which will block to acquire the 
initialization lock
  stateChangeHandlers.values.foreach(callBeforeInitializingSession _)  //  
zk-session-expiry-handler0 is blocked here

  inWriteLock(initializationLock) {//  program is not running here
if (!connectionState.isAlive) {
  zooKeeper.close()
  info(s"Initializing a new session to $connectString.")
  // retry forever until ZooKeeper can be instantiated
  var connected = false
  while (!connected) {
try {
  zooKeeper = new ZooKeeper(connectString, sessionTimeoutMs, 
ZooKeeperClientWatcher)
  connected = true
} catch {
  case e: Exception =>
info("Error when recreating ZooKeeper, retrying after a short 
sleep", e)
Thread.sleep(1000)
}
  }
}
  }

  stateChangeHandlers.values.foreach(callAfterInitializingSession _)
}
{code}
so i think the problem is occurred in function close()

 


was (Author: lbdai3190):
Here is some analysis of mine according to the jstack log and source code:

When the deadlock occurred,
 # The zk-session-expiry-handler0 thread is reinitializing a new zookeeper 
instance, the expireEvent was put to queue of ControllerEventManager and 
waitting to handle by controller-event-handle thread. (*In the time, kafka is 
disconnected with zookeeper yet*).
 # However , in the same time, the controller-event-thread is handling 
Startup(type of ControllerEvent), as bellow:

{code:java}
//KafkaController.scala
case object Startup extends ControllerEvent {

  def state = ControllerState.ControllerChange

  override def process(): Unit = {

zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler)
elect()
  }
}
{code}
In this process, 

[jira] [Updated] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2019-06-12 Thread leibo (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leibo updated KAFKA-8532:
-
Description: 
We have observed a serious deadlock between controller-event-thead and 
zk-session-expirey-handle thread. When this issue occurred, it's only way to 
recovery the kafka cluster is restart kafka server. The  follows is the jstack 
log of controller-event-thead and zk-session-expiry-handle thread.

"zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 
tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000]
 java.lang.Thread.State: WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 - parking to wait for <0x0005ee3f7000> (a 
java.util.concurrent.CountDownLatch$Sync)
 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
 at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // 
等待controller-event-thread线程处理expireEvent
 at 
kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533)
 at 
kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173)
 at 
kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408)
 at 
kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374)
 at 
kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374)
 at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown 
Source)
 at scala.collection.Iterator.foreach(Iterator.scala:937)
 at scala.collection.Iterator.foreach$(Iterator.scala:937)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1425)
 at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209)
 at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374)
 at 
kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428)
 at kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown 
Source)
 at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114)
 at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown 
Source)
 at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
 at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)

Locked ownable synchronizers:
 - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker)

"controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 nid=0x310 
waiting on condition [0x7fccb55c8000]
 java.lang.Thread.State: WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 - parking to wait for <0x0005d1be5a00> (a 
java.util.concurrent.CountDownLatch$Sync)
 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
 at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
 at kafka.zookeeper.ZooKeeperClient.handleRequests(ZooKeeperClient.scala:157)
 at kafka.zk.KafkaZkClient.retryRequestsUntilConnected(KafkaZkClient.scala:1596)
 at 
kafka.zk.KafkaZkClient.kafka$zk$KafkaZkClient$$retryRequestUntilConnected(KafkaZkClient.scala:1589)
 at 
kafka.zk.KafkaZkClient.deletePreferredReplicaElection(KafkaZkClient.scala:989)
 at 
kafka.controller.KafkaController.removePartitionsFromPreferredReplicaElection(KafkaController.scala:873)
 at 
kafka.controller.KafkaController.kafka$controller$KafkaController$$onPreferredReplicaElection(KafkaController.scala:631)
 at 
kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:266)
 at 
kafka.controller.KafkaController.kafka$controller$KafkaController$$elect(KafkaController.scala:1221)
 at 
kafka.controller.KafkaController$Reelect$.proces

[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2019-06-12 Thread leibo (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862668#comment-16862668
 ] 

leibo edited comment on KAFKA-8532 at 6/13/19 6:45 AM:
---

Here is some analysis of mine according to the jstack log and source code:

When the deadlock occurred,
 # The zk-session-expiry-handler0 thread is reinitializing a new zookeeper 
instance, the expireEvent was put to queue of ControllerEventManager and 
waitting to handle by controller-event-handle thread. (*In the time, kafka is 
disconnected with zookeeper yet*).
 # However , in the same time, the controller-event-thread is handling 
Startup(type of ControllerEvent), as bellow:

{code:java}
//KafkaController.scala
case object Startup extends ControllerEvent {

  def state = ControllerState.ControllerChange

  override def process(): Unit = {

zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler)
elect()
  }
}
{code}
In this process,  it is blocked at ZookeeperClient.handleRequests, code as 
bellow:
{code:java}
//代码占位符
def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] 
= {
  if (requests.isEmpty)
Seq.empty
  else {
val countDownLatch = new CountDownLatch(requests.size)
val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size)

requests.foreach { request =>
  inFlightRequests.acquire()
  try {
inReadLock(initializationLock) {
  send(request) { response =>
responseQueue.add(response)
inFlightRequests.release()
countDownLatch.countDown()
  }
}
  } catch {
case e: Throwable =>
  inFlightRequests.release()
  throw e
  }
}
countDownLatch.await()
responseQueue.asScala.toBuffer
  }
}
{code}
At this time, Zookeeper server is running normally, and all function is working 
well.

I suspect the key point is the *initializationLock* , because if isReadLock is 
success, it will throw exception when kafka disconnected with zookeeper, but we 
never see any exception log in the kafka server.log

controller-event-thread can not doing inReadLock(initializationLock) perhaps 
due to writeLock was not been release.

There are tow way to donging inWriteLock(initailizationLock):

 1.  close():
{code:java}
expiryScheduler.startup()
try waitUntilConnected(connectionTimeoutMs, TimeUnit.MILLISECONDS)
catch {
  case e: Throwable =>
close() //此处close的时候,加写锁
throw e
}

def close(): Unit = {
  info("Closing.")
  inWriteLock(initializationLock) {
zNodeChangeHandlers.clear()
zNodeChildChangeHandlers.clear()
stateChangeHandlers.clear()
zooKeeper.close()
metricNames.foreach(removeMetric(_))
  }
  // Shutdown scheduler outside of lock to avoid deadlock if scheduler
  // is waiting for lock to process session expiry
  expiryScheduler.shutdown()
  info("Closed.")
}
{code}
2.  reinitialize()
{code:java}
private def reinitialize(): Unit = {
  // Initialization callbacks are invoked outside of the lock to avoid deadlock 
potential since their completion
  // may require additional Zookeeper requests, which will block to acquire the 
initialization lock
  stateChangeHandlers.values.foreach(callBeforeInitializingSession _)  //  
zk-session-expiry-handler0 is blocked here

  inWriteLock(initializationLock) {//  program is not running here
if (!connectionState.isAlive) {
  zooKeeper.close()
  info(s"Initializing a new session to $connectString.")
  // retry forever until ZooKeeper can be instantiated
  var connected = false
  while (!connected) {
try {
  zooKeeper = new ZooKeeper(connectString, sessionTimeoutMs, 
ZooKeeperClientWatcher)
  connected = true
} catch {
  case e: Exception =>
info("Error when recreating ZooKeeper, retrying after a short 
sleep", e)
Thread.sleep(1000)
}
  }
}
  }

  stateChangeHandlers.values.foreach(callAfterInitializingSession _)
}
{code}
so i think the problem is occurred in function close()

 


was (Author: lbdai3190):
Here is some analysis of mine according to the jstack log and source code:

When the deadlock occurred,
 # The zk-session-expiry-handler0 thread is reinitializing a new zookeeper 
instance, the expireEvent was put to queue of ControllerEventManager and 
waitting to handle by controller-event-handle thread. (*In the time, kafka is 
disconnected with zookeeper yet*).
 # However , in the same time, the controller-event-thread is handling Startup 
type of ControllerEvent, as bellow:

{code:java}
//KafkaController.scala
case object Startup extends ControllerEvent {

  def state = ControllerState.ControllerChange

  override def process(): Unit = {

zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler)
elect()
  }
}
{code}
In this process,  it

[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2019-06-12 Thread leibo (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862668#comment-16862668
 ] 

leibo edited comment on KAFKA-8532 at 6/13/19 3:58 AM:
---

Here is some analysis of mine according to the jstack log and source code:

When the deadlock occurred,
 # The zk-session-expiry-handler0 thread is reinitializing a new zookeeper 
instance, the expireEvent was put to queue of ControllerEventManager and 
waitting to handle by controller-event-handle thread. (*In the time, kafka is 
disconnected with zookeeper yet*).
 # However , in the same time, the controller-event-thread is handling Startup 
type of ControllerEvent, as bellow:

{code:java}
//KafkaController.scala
case object Startup extends ControllerEvent {

  def state = ControllerState.ControllerChange

  override def process(): Unit = {

zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler)
elect()
  }
}
{code}
In this process,  it is blocked at ZookeeperClient.handleRequests, code as 
bellow:
{code:java}
//代码占位符
def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] 
= {
  if (requests.isEmpty)
Seq.empty
  else {
val countDownLatch = new CountDownLatch(requests.size)
val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size)

requests.foreach { request =>
  inFlightRequests.acquire()
  try {
inReadLock(initializationLock) {
  send(request) { response =>
responseQueue.add(response)
inFlightRequests.release()
countDownLatch.countDown()
  }
}
  } catch {
case e: Throwable =>
  inFlightRequests.release()
  throw e
  }
}
countDownLatch.await()
responseQueue.asScala.toBuffer
  }
}
{code}
At this time, Zookeeper server is running normally, and all function is working 
well.

I suspect the key point is the *initializationLock* , because if isReadLock is 
success, it will throw exception when kafka disconnected with zookeeper, but we 
never see any exception log in the kafka server.log

controller-event-thread can not doing inReadLock(initializationLock) perhaps 
due to writeLock was not been release.

There are tow way to donging inWriteLock(initailizationLock):

 1.  close():

 
{code:java}
def close(): Unit = {
  info("Closing.")
  inWriteLock(initializationLock) {
zNodeChangeHandlers.clear()
zNodeChildChangeHandlers.clear()
stateChangeHandlers.clear()
zooKeeper.close()
metricNames.foreach(removeMetric(_))
  }
  // Shutdown scheduler outside of lock to avoid deadlock if scheduler
  // is waiting for lock to process session expiry
  expiryScheduler.shutdown()
  info("Closed.")
}
{code}
2.  reinitialize()
{code:java}
private def reinitialize(): Unit = {
  // Initialization callbacks are invoked outside of the lock to avoid deadlock 
potential since their completion
  // may require additional Zookeeper requests, which will block to acquire the 
initialization lock
  stateChangeHandlers.values.foreach(callBeforeInitializingSession _)  //  
zk-session-expiry-handler0 is blocked here

  inWriteLock(initializationLock) {//  program is not running here
if (!connectionState.isAlive) {
  zooKeeper.close()
  info(s"Initializing a new session to $connectString.")
  // retry forever until ZooKeeper can be instantiated
  var connected = false
  while (!connected) {
try {
  zooKeeper = new ZooKeeper(connectString, sessionTimeoutMs, 
ZooKeeperClientWatcher)
  connected = true
} catch {
  case e: Exception =>
info("Error when recreating ZooKeeper, retrying after a short 
sleep", e)
Thread.sleep(1000)
}
  }
}
  }

  stateChangeHandlers.values.foreach(callAfterInitializingSession _)
}
{code}
so i think the problem is occurred in function close()

 


was (Author: lbdai3190):
Here is some analysis of mine according to the jstack log and source code:

When the deadlock occurred,
 # The zk-session-expiry-handler0 thread is reinitializing a new zookeeper 
instance, the expireEvent was put to queue of ControllerEventManager and 
waitting to handle by controller-event-handle thread. (*In the time, kafka is 
disconnected with zookeeper yet*).
 # However , in the same time, the controller-event-thread is handling Startup 
type of ControllerEvent, as bellow:

{code:java}
//KafkaController.scala
case object Startup extends ControllerEvent {

  def state = ControllerState.ControllerChange

  override def process(): Unit = {

zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler)
elect()
  }
}
{code}
In this process,  it is blocked at ZookeeperClient.handleRequests, code as 
bellow:
{code:java}
//代码占位符
def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] 
= {
  if (

[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2019-06-12 Thread leibo (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862668#comment-16862668
 ] 

leibo edited comment on KAFKA-8532 at 6/13/19 4:00 AM:
---

Here is some analysis of mine according to the jstack log and source code:

When the deadlock occurred,
 # The zk-session-expiry-handler0 thread is reinitializing a new zookeeper 
instance, the expireEvent was put to queue of ControllerEventManager and 
waitting to handle by controller-event-handle thread. (*In the time, kafka is 
disconnected with zookeeper yet*).
 # However , in the same time, the controller-event-thread is handling Startup 
type of ControllerEvent, as bellow:

{code:java}
//KafkaController.scala
case object Startup extends ControllerEvent {

  def state = ControllerState.ControllerChange

  override def process(): Unit = {

zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler)
elect()
  }
}
{code}
In this process,  it is blocked at ZookeeperClient.handleRequests, code as 
bellow:
{code:java}
//代码占位符
def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] 
= {
  if (requests.isEmpty)
Seq.empty
  else {
val countDownLatch = new CountDownLatch(requests.size)
val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size)

requests.foreach { request =>
  inFlightRequests.acquire()
  try {
inReadLock(initializationLock) {
  send(request) { response =>
responseQueue.add(response)
inFlightRequests.release()
countDownLatch.countDown()
  }
}
  } catch {
case e: Throwable =>
  inFlightRequests.release()
  throw e
  }
}
countDownLatch.await()
responseQueue.asScala.toBuffer
  }
}
{code}
At this time, Zookeeper server is running normally, and all function is working 
well.

I suspect the key point is the *initializationLock* , because if isReadLock is 
success, it will throw exception when kafka disconnected with zookeeper, but we 
never see any exception log in the kafka server.log

controller-event-thread can not doing inReadLock(initializationLock) perhaps 
due to writeLock was not been release.

There are tow way to donging inWriteLock(initailizationLock):

 1.  close():
{code:java}
expiryScheduler.startup()
try waitUntilConnected(connectionTimeoutMs, TimeUnit.MILLISECONDS)
catch {
  case e: Throwable =>
close() //此处close的时候,加写锁
throw e
}

def close(): Unit = {
  info("Closing.")
  inWriteLock(initializationLock) {
zNodeChangeHandlers.clear()
zNodeChildChangeHandlers.clear()
stateChangeHandlers.clear()
zooKeeper.close()
metricNames.foreach(removeMetric(_))
  }
  // Shutdown scheduler outside of lock to avoid deadlock if scheduler
  // is waiting for lock to process session expiry
  expiryScheduler.shutdown()
  info("Closed.")
}
{code}
2.  reinitialize()
{code:java}
private def reinitialize(): Unit = {
  // Initialization callbacks are invoked outside of the lock to avoid deadlock 
potential since their completion
  // may require additional Zookeeper requests, which will block to acquire the 
initialization lock
  stateChangeHandlers.values.foreach(callBeforeInitializingSession _)  //  
zk-session-expiry-handler0 is blocked here

  inWriteLock(initializationLock) {//  program is not running here
if (!connectionState.isAlive) {
  zooKeeper.close()
  info(s"Initializing a new session to $connectString.")
  // retry forever until ZooKeeper can be instantiated
  var connected = false
  while (!connected) {
try {
  zooKeeper = new ZooKeeper(connectString, sessionTimeoutMs, 
ZooKeeperClientWatcher)
  connected = true
} catch {
  case e: Exception =>
info("Error when recreating ZooKeeper, retrying after a short 
sleep", e)
Thread.sleep(1000)
}
  }
}
  }

  stateChangeHandlers.values.foreach(callAfterInitializingSession _)
}
{code}
so i think the problem is occurred in function close()

 


was (Author: lbdai3190):
Here is some analysis of mine according to the jstack log and source code:

When the deadlock occurred,
 # The zk-session-expiry-handler0 thread is reinitializing a new zookeeper 
instance, the expireEvent was put to queue of ControllerEventManager and 
waitting to handle by controller-event-handle thread. (*In the time, kafka is 
disconnected with zookeeper yet*).
 # However , in the same time, the controller-event-thread is handling Startup 
type of ControllerEvent, as bellow:

{code:java}
//KafkaController.scala
case object Startup extends ControllerEvent {

  def state = ControllerState.ControllerChange

  override def process(): Unit = {

zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler)
elect()
  }
}
{code}
In this process,  it 

[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2019-06-12 Thread leibo (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862668#comment-16862668
 ] 

leibo edited comment on KAFKA-8532 at 6/13/19 3:58 AM:
---

Here is some analysis of mine according to the jstack log and source code:

When the deadlock occurred,
 # The zk-session-expiry-handler0 thread is reinitializing a new zookeeper 
instance, the expireEvent was put to queue of ControllerEventManager and 
waitting to handle by controller-event-handle thread. (*In the time, kafka is 
disconnected with zookeeper yet*).
 # However , in the same time, the controller-event-thread is handling Startup 
type of ControllerEvent, as bellow:

{code:java}
//KafkaController.scala
case object Startup extends ControllerEvent {

  def state = ControllerState.ControllerChange

  override def process(): Unit = {

zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler)
elect()
  }
}
{code}
In this process,  it is blocked at ZookeeperClient.handleRequests, code as 
bellow:
{code:java}
//代码占位符
def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] 
= {
  if (requests.isEmpty)
Seq.empty
  else {
val countDownLatch = new CountDownLatch(requests.size)
val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size)

requests.foreach { request =>
  inFlightRequests.acquire()
  try {
inReadLock(initializationLock) {
  send(request) { response =>
responseQueue.add(response)
inFlightRequests.release()
countDownLatch.countDown()
  }
}
  } catch {
case e: Throwable =>
  inFlightRequests.release()
  throw e
  }
}
countDownLatch.await()
responseQueue.asScala.toBuffer
  }
}
{code}
At this time, Zookeeper server is running normally, and all function is working 
well.

I suspect the key point is the *initializationLock* , because if isReadLock is 
success, it will throw exception when kafka disconnected with zookeeper, but we 
never see any exception log in the kafka server.log

controller-event-thread can not doing inReadLock(initializationLock) perhaps 
due to writeLock was not been release.

There are tow way to donging inWriteLock(initailizationLock):

 1.  close():

 
{code:java}
def close(): Unit = {
  info("Closing.")
  inWriteLock(initializationLock) {
zNodeChangeHandlers.clear()
zNodeChildChangeHandlers.clear()
stateChangeHandlers.clear()
zooKeeper.close()
metricNames.foreach(removeMetric(_))
  }
  // Shutdown scheduler outside of lock to avoid deadlock if scheduler
  // is waiting for lock to process session expiry
  expiryScheduler.shutdown()
  info("Closed.")
}
{code}
2.  reinitialize()

 

 
{code:java}
private def reinitialize(): Unit = {
  // Initialization callbacks are invoked outside of the lock to avoid deadlock 
potential since their completion
  // may require additional Zookeeper requests, which will block to acquire the 
initialization lock
  stateChangeHandlers.values.foreach(callBeforeInitializingSession _)  //  
zk-session-expiry-handler0 is blocked here

  inWriteLock(initializationLock) {//  program is not running here
if (!connectionState.isAlive) {
  zooKeeper.close()
  info(s"Initializing a new session to $connectString.")
  // retry forever until ZooKeeper can be instantiated
  var connected = false
  while (!connected) {
try {
  zooKeeper = new ZooKeeper(connectString, sessionTimeoutMs, 
ZooKeeperClientWatcher)
  connected = true
} catch {
  case e: Exception =>
info("Error when recreating ZooKeeper, retrying after a short 
sleep", e)
Thread.sleep(1000)
}
  }
}
  }

  stateChangeHandlers.values.foreach(callAfterInitializingSession _)
}
{code}
so i think the problem is occurred in function close()

 


was (Author: lbdai3190):
Here is some analysis of mine according to the jstack log and source code:

When the deadlock occurred,
 # The zk-session-expiry-handler0 thread is reinitializing a new zookeeper 
instance, the expireEvent was put to queue of ControllerEventManager and 
waitting to handle by controller-event-handle thread. (*In the time, kafka is 
disconnected with zookeeper yet*).
 # However , in the same time, the controller-event-thread is handling Startup 
type of ControllerEvent, as bellow:

{code:java}
//KafkaController.scala
case object Startup extends ControllerEvent {

  def state = ControllerState.ControllerChange

  override def process(): Unit = {

zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler)
elect()
  }
}
{code}
In this process,  it is blocked at ZookeeperClient.handleRequests, code as 
bellow:
{code:java}
//代码占位符
def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] 
= {

[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2019-06-12 Thread leibo (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862668#comment-16862668
 ] 

leibo edited comment on KAFKA-8532 at 6/13/19 3:58 AM:
---

Here is some analysis of mine according to the jstack log and source code:

When the deadlock occurred,
 # The zk-session-expiry-handler0 thread is reinitializing a new zookeeper 
instance, the expireEvent was put to queue of ControllerEventManager and 
waitting to handle by controller-event-handle thread. (*In the time, kafka is 
disconnected with zookeeper yet*).
 # However , in the same time, the controller-event-thread is handling Startup 
type of ControllerEvent, as bellow:

{code:java}
//KafkaController.scala
case object Startup extends ControllerEvent {

  def state = ControllerState.ControllerChange

  override def process(): Unit = {

zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler)
elect()
  }
}
{code}
In this process,  it is blocked at ZookeeperClient.handleRequests, code as 
bellow:
{code:java}
//代码占位符
def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] 
= {
  if (requests.isEmpty)
Seq.empty
  else {
val countDownLatch = new CountDownLatch(requests.size)
val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size)

requests.foreach { request =>
  inFlightRequests.acquire()
  try {
inReadLock(initializationLock) {
  send(request) { response =>
responseQueue.add(response)
inFlightRequests.release()
countDownLatch.countDown()
  }
}
  } catch {
case e: Throwable =>
  inFlightRequests.release()
  throw e
  }
}
countDownLatch.await()
responseQueue.asScala.toBuffer
  }
}
{code}
At this time, Zookeeper server is running normally, and all function is working 
well.

I suspect the key point is the *initializationLock* , because if isReadLock is 
success, it will throw exception when kafka disconnected with zookeeper, but we 
never see any exception log in the kafka server.log

controller-event-thread can not doing inReadLock(initializationLock) perhaps 
due to writeLock was not been release.

There are tow way to donging inWriteLock(initailizationLock):

 1.  close():
{code:java}
def close(): Unit = {
  info("Closing.")
  inWriteLock(initializationLock) {
zNodeChangeHandlers.clear()
zNodeChildChangeHandlers.clear()
stateChangeHandlers.clear()
zooKeeper.close()
metricNames.foreach(removeMetric(_))
  }
  // Shutdown scheduler outside of lock to avoid deadlock if scheduler
  // is waiting for lock to process session expiry
  expiryScheduler.shutdown()
  info("Closed.")
}
{code}
2.  reinitialize()
{code:java}
private def reinitialize(): Unit = {
  // Initialization callbacks are invoked outside of the lock to avoid deadlock 
potential since their completion
  // may require additional Zookeeper requests, which will block to acquire the 
initialization lock
  stateChangeHandlers.values.foreach(callBeforeInitializingSession _)  //  
zk-session-expiry-handler0 is blocked here

  inWriteLock(initializationLock) {//  program is not running here
if (!connectionState.isAlive) {
  zooKeeper.close()
  info(s"Initializing a new session to $connectString.")
  // retry forever until ZooKeeper can be instantiated
  var connected = false
  while (!connected) {
try {
  zooKeeper = new ZooKeeper(connectString, sessionTimeoutMs, 
ZooKeeperClientWatcher)
  connected = true
} catch {
  case e: Exception =>
info("Error when recreating ZooKeeper, retrying after a short 
sleep", e)
Thread.sleep(1000)
}
  }
}
  }

  stateChangeHandlers.values.foreach(callAfterInitializingSession _)
}
{code}
so i think the problem is occurred in function close()

 


was (Author: lbdai3190):
Here is some analysis of mine according to the jstack log and source code:

When the deadlock occurred,
 # The zk-session-expiry-handler0 thread is reinitializing a new zookeeper 
instance, the expireEvent was put to queue of ControllerEventManager and 
waitting to handle by controller-event-handle thread. (*In the time, kafka is 
disconnected with zookeeper yet*).
 # However , in the same time, the controller-event-thread is handling Startup 
type of ControllerEvent, as bellow:

{code:java}
//KafkaController.scala
case object Startup extends ControllerEvent {

  def state = ControllerState.ControllerChange

  override def process(): Unit = {

zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler)
elect()
  }
}
{code}
In this process,  it is blocked at ZookeeperClient.handleRequests, code as 
bellow:
{code:java}
//代码占位符
def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] 
= {
  if (req

[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2019-06-12 Thread leibo (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862668#comment-16862668
 ] 

leibo edited comment on KAFKA-8532 at 6/13/19 3:34 AM:
---

Here is some analysis of mine according to the jstack log and source code:

When the deadlock occurred,
 # The zk-session-expiry-handler0 thread is reinitializing a new zookeeper 
instance, the expireEvent was put to queue of ControllerEventManager and 
waitting to handle by controller-event-handle thread. (*In the time, kafka is 
disconnected with zookeeper yet*).
 # However , in the same time, the controller-event-thread is handling Startup 
type of ControllerEvent, as bellow:

{code:java}
//KafkaController.scala
case object Startup extends ControllerEvent {

  def state = ControllerState.ControllerChange

  override def process(): Unit = {

zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler)
elect()
  }
}
{code}
In this process,  it is blocked at ZookeeperClient.handleRequests, code as 
bellow:
{code:java}
//代码占位符
def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] 
= {
  if (requests.isEmpty)
Seq.empty
  else {
val countDownLatch = new CountDownLatch(requests.size)
val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size)

requests.foreach { request =>
  inFlightRequests.acquire()
  try {
inReadLock(initializationLock) {
  send(request) { response =>
responseQueue.add(response)
inFlightRequests.release()
countDownLatch.countDown()
  }
}
  } catch {
case e: Throwable =>
  inFlightRequests.release()
  throw e
  }
}
countDownLatch.await()
responseQueue.asScala.toBuffer
  }
}
{code}
At this time, Zookeeper server is running normally, and all function is working 
well.

I suspect the key point is the *initializationLock* , because if isReadLock is 
success, it will throw exception when kafka disconnected with zookeeper, but we 
never see any exception log in the kafka server.log


was (Author: lbdai3190):
Here is some analysis of mine according to the jstack log and source code:

When the deadlock occurred,
 # The zk-session-expiry-handler0 thread is reinitializing a new zookeeper 
instance, the expireEvent was put to queue of ControllerEventManager and 
waitting to handle by controller-event-handle thread. (*In the time, kafka is 
disconnected with zookeeper yet*).
 # However , in the same time, the controller-event-thread is handling Startup 
type of ControllerEvent, as bellow:

{code:java}
//KafkaController.scala
case object Startup extends ControllerEvent {

  def state = ControllerState.ControllerChange

  override def process(): Unit = {

zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler)
elect()
  }
}
{code}
In this process,  it is blocked at ZookeeperClient.handleRequests, code as 
bellow:
{code:java}
//代码占位符
def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] 
= {
  if (requests.isEmpty)
Seq.empty
  else {
val countDownLatch = new CountDownLatch(requests.size)
val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size)

requests.foreach { request =>
  inFlightRequests.acquire()
  try {
inReadLock(initializationLock) {
  send(request) { response =>
responseQueue.add(response)
inFlightRequests.release()
countDownLatch.countDown()
  }
}
  } catch {
case e: Throwable =>
  inFlightRequests.release()
  throw e
  }
}
countDownLatch.await()
responseQueue.asScala.toBuffer
  }
}
{code}
At this time, Zookeeper server is running normally, and all function is working 
well.

I suspect the key point is the *initializationLock* , because is isReadLock is 
success, it will throw exception when kafka disconnected with zookeeper, but we 
never see any exception log in the kafka server.log

> controller-event-thread deadlock with zk-session-expiry-handler0
> 
>
> Key: KAFKA-8532
> URL: https://issues.apache.org/jira/browse/KAFKA-8532
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Blocker
> Attachments: js.log
>
>
> We have observed a serious deadlock between controller-event-thead and 
> zk-session-expirey-handle thread. When this issue occurred, it's only way to 
> recovery the kafka cluster is restart kafka server. Bellow is the jstack log 
> of controller-event-thead and zk-session-expiry-handle thread.
> "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 
> tid=0x7fcc9c01 nid=0xfb22

[jira] [Comment Edited] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2019-06-12 Thread leibo (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862668#comment-16862668
 ] 

leibo edited comment on KAFKA-8532 at 6/13/19 3:31 AM:
---

Here is some analysis of mine according to the jstack log and source code:

When the deadlock occurred,
 # The zk-session-expiry-handler0 thread is reinitializing a new zookeeper 
instance, the expireEvent was put to queue of ControllerEventManager and 
waitting to handle by controller-event-handle thread. (*In the time, kafka is 
disconnected with zookeeper yet*).
 # However , in the same time, the controller-event-thread is handling Startup 
type of ControllerEvent, as bellow:

{code:java}
//KafkaController.scala
case object Startup extends ControllerEvent {

  def state = ControllerState.ControllerChange

  override def process(): Unit = {

zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler)
elect()
  }
}
{code}
In this process,  it is blocked at ZookeeperClient.handleRequests, code as 
bellow:
{code:java}
//代码占位符
def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] 
= {
  if (requests.isEmpty)
Seq.empty
  else {
val countDownLatch = new CountDownLatch(requests.size)
val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size)

requests.foreach { request =>
  inFlightRequests.acquire()
  try {
inReadLock(initializationLock) {
  send(request) { response =>
responseQueue.add(response)
inFlightRequests.release()
countDownLatch.countDown()
  }
}
  } catch {
case e: Throwable =>
  inFlightRequests.release()
  throw e
  }
}
countDownLatch.await()
responseQueue.asScala.toBuffer
  }
}
{code}
At this time, Zookeeper server is running normally, and all function is working 
well.

I suspect the key point is the *initializationLock* , because is isReadLock is 
success, it will throw exception when kafka disconnected with zookeeper, but we 
never see any exception log in the kafka server.log


was (Author: lbdai3190):
Here is some analysis of mine according to the jstack log and source code:

When the deadlock occurred,
 # The zk-session-expiry-handler0 thread is reinitializing a new zookeeper 
instance, the expireEvent was put to queue of ControllerEventManager and 
waitting to handle by controller-event-handle thread. (*In the time, kafka is 
disconnected with zookeeper yet*).
 # However , in the same time, the controller-event-thread is handling Startup 
type of ControllerEvent, as bellow:

{code:java}
//KafkaController.scala
case object Startup extends ControllerEvent {

  def state = ControllerState.ControllerChange

  override def process(): Unit = {

zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler)
elect()
  }
}
{code}
In this process,  it is blocked at ZookeeperClient.handleRequests, code as 
bellow:
{code:java}
//代码占位符
def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] 
= {
  if (requests.isEmpty)
Seq.empty
  else {
val countDownLatch = new CountDownLatch(requests.size)
val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size)

requests.foreach { request =>
  inFlightRequests.acquire()
  try {
inReadLock(initializationLock) {
  send(request) { response =>
responseQueue.add(response)
inFlightRequests.release()
countDownLatch.countDown()
  }
}
  } catch {
case e: Throwable =>
  inFlightRequests.release()
  throw e
  }
}
countDownLatch.await()
responseQueue.asScala.toBuffer
  }
}

{code}

> controller-event-thread deadlock with zk-session-expiry-handler0
> 
>
> Key: KAFKA-8532
> URL: https://issues.apache.org/jira/browse/KAFKA-8532
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Blocker
> Attachments: js.log
>
>
> We have observed a serious deadlock between controller-event-thead and 
> zk-session-expirey-handle thread. When this issue occurred, it's only way to 
> recovery the kafka cluster is restart kafka server. Bellow is the jstack log 
> of controller-event-thead and zk-session-expiry-handle thread.
> "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 
> tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005ee3f7000> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  a

[jira] [Commented] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2019-06-12 Thread leibo (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862668#comment-16862668
 ] 

leibo commented on KAFKA-8532:
--

Here is some analysis of mine according to the jstack log and source code:

When the deadlock occurred,
 # The zk-session-expiry-handler0 thread is reinitializing a new zookeeper 
instance, the expireEvent was put to queue of ControllerEventManager and 
waitting to handle by controller-event-handle thread. (*In the time, kafka is 
disconnected with zookeeper yet*).
 # However , in the same time, the controller-event-thread is handling Startup 
type of ControllerEvent, as bellow:

{code:java}
//KafkaController.scala
case object Startup extends ControllerEvent {

  def state = ControllerState.ControllerChange

  override def process(): Unit = {

zkClient.registerZNodeChangeHandlerAndCheckExistence(controllerChangeHandler)
elect()
  }
}
{code}
In this process,  it is blocked at ZookeeperClient.handleRequests, code as 
bellow:
{code:java}
//代码占位符
def handleRequests[Req <: AsyncRequest](requests: Seq[Req]): Seq[Req#Response] 
= {
  if (requests.isEmpty)
Seq.empty
  else {
val countDownLatch = new CountDownLatch(requests.size)
val responseQueue = new ArrayBlockingQueue[Req#Response](requests.size)

requests.foreach { request =>
  inFlightRequests.acquire()
  try {
inReadLock(initializationLock) {
  send(request) { response =>
responseQueue.add(response)
inFlightRequests.release()
countDownLatch.countDown()
  }
}
  } catch {
case e: Throwable =>
  inFlightRequests.release()
  throw e
  }
}
countDownLatch.await()
responseQueue.asScala.toBuffer
  }
}

{code}

> controller-event-thread deadlock with zk-session-expiry-handler0
> 
>
> Key: KAFKA-8532
> URL: https://issues.apache.org/jira/browse/KAFKA-8532
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Blocker
> Attachments: js.log
>
>
> We have observed a serious deadlock between controller-event-thead and 
> zk-session-expirey-handle thread. When this issue occurred, it's only way to 
> recovery the kafka cluster is restart kafka server. Bellow is the jstack log 
> of controller-event-thead and zk-session-expiry-handle thread.
> "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 
> tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005ee3f7000> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // 
> 等待controller-event-thread线程处理expireEvent
>  at 
> kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533)
>  at 
> kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173)
>  at 
> kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374)
>  at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown 
> Source)
>  at scala.collection.Iterator.foreach(Iterator.scala:937)
>  at scala.collection.Iterator.foreach$(Iterator.scala:937)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1425)
>  at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209)
>  at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428)
>  at 
> kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114)
>  at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:

[jira] [Updated] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2019-06-12 Thread leibo (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leibo updated KAFKA-8532:
-
Attachment: js.log

> controller-event-thread deadlock with zk-session-expiry-handler0
> 
>
> Key: KAFKA-8532
> URL: https://issues.apache.org/jira/browse/KAFKA-8532
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
>Reporter: leibo
>Priority: Blocker
> Attachments: js.log
>
>
> We have observed a serious deadlock between controller-event-thead and 
> zk-session-expirey-handle thread. When this issue occurred, it's only way to 
> recovery the kafka cluster is restart kafka server. Bellow is the jstack log 
> of controller-event-thead and zk-session-expiry-handle thread.
> "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 
> tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005ee3f7000> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // 
> 等待controller-event-thread线程处理expireEvent
>  at 
> kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533)
>  at 
> kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173)
>  at 
> kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374)
>  at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown 
> Source)
>  at scala.collection.Iterator.foreach(Iterator.scala:937)
>  at scala.collection.Iterator.foreach$(Iterator.scala:937)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1425)
>  at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209)
>  at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428)
>  at 
> kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114)
>  at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> Locked ownable synchronizers:
>  - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker)
> "controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 
> nid=0x310 waiting on condition [0x7fccb55c8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x0005d1be5a00> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
>  at kafka.zookeeper.ZooKeeperClient.handleRequests(ZooKeeperClient.scala:157)
>  at 
> kafka.zk.KafkaZkClient.retryRequestsUntilConnected(KafkaZkClient.scala:1596)
>  at 
> kafka.zk.KafkaZkClient.kafka$zk$KafkaZkClient$$retryRequestUnti

[jira] [Updated] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2019-06-12 Thread leibo (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leibo updated KAFKA-8532:
-
Description: 
We have observed a serious deadlock between controller-event-thead and 
zk-session-expirey-handle thread. When this issue occurred, it's only way to 
recovery the kafka cluster is restart kafka server. Bellow is the jstack log of 
controller-event-thead and zk-session-expiry-handle thread.

"zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 
tid=0x7fcc9c01 nid=0xfb22 waiting on condition [0x7fcbb01f8000]
 java.lang.Thread.State: WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 - parking to wait for <0x0005ee3f7000> (a 
java.util.concurrent.CountDownLatch$Sync)
 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
 at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // 
等待controller-event-thread线程处理expireEvent
 at 
kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533)
 at 
kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173)
 at 
kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408)
 at 
kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374)
 at 
kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374)
 at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown 
Source)
 at scala.collection.Iterator.foreach(Iterator.scala:937)
 at scala.collection.Iterator.foreach$(Iterator.scala:937)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1425)
 at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209)
 at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374)
 at 
kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428)
 at kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown 
Source)
 at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114)
 at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown 
Source)
 at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
 at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)

Locked ownable synchronizers:
 - <0x000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker)

"controller-event-thread" #51 prio=5 os_prio=0 tid=0x7fceaeec4000 nid=0x310 
waiting on condition [0x7fccb55c8000]
 java.lang.Thread.State: WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 - parking to wait for <0x0005d1be5a00> (a 
java.util.concurrent.CountDownLatch$Sync)
 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
 at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
 at kafka.zookeeper.ZooKeeperClient.handleRequests(ZooKeeperClient.scala:157)
 at kafka.zk.KafkaZkClient.retryRequestsUntilConnected(KafkaZkClient.scala:1596)
 at 
kafka.zk.KafkaZkClient.kafka$zk$KafkaZkClient$$retryRequestUntilConnected(KafkaZkClient.scala:1589)
 at 
kafka.zk.KafkaZkClient.deletePreferredReplicaElection(KafkaZkClient.scala:989)
 at 
kafka.controller.KafkaController.removePartitionsFromPreferredReplicaElection(KafkaController.scala:873)
 at 
kafka.controller.KafkaController.kafka$controller$KafkaController$$onPreferredReplicaElection(KafkaController.scala:631)
 at 
kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:266)
 at 
kafka.controller.KafkaController.kafka$controller$KafkaController$$elect(KafkaController.scala:1221)
 at 
kafka.controller.KafkaController$Reelect$.process(Kafk

[jira] [Created] (KAFKA-8532) controller-event-thread deadlock with zk-session-expiry-handler0

2019-06-12 Thread leibo (JIRA)
leibo created KAFKA-8532:


 Summary: controller-event-thread deadlock with 
zk-session-expiry-handler0
 Key: KAFKA-8532
 URL: https://issues.apache.org/jira/browse/KAFKA-8532
 Project: Kafka
  Issue Type: Bug
  Components: core
Affects Versions: 2.1.1
Reporter: leibo






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (KAFKA-5998) /.checkpoint.tmp Not found exception

2019-03-04 Thread leibo (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-5998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16784126#comment-16784126
 ] 

leibo edited comment on KAFKA-5998 at 3/5/19 6:57 AM:
--

It seems that {color:#ff} final File temp = new File(file.getAbsolutePath() 
+ ".tmp");{color} will not create new file on some system sometimes. can we do 
this as bellow before write offset to checkpoint.tmp file.
{code:java}
//代码占位符
final File temp = new File(file.getAbsolutePath() + ".tmp");
if (!temp.exists()) {       // create .checkpoint.tmp file if it's not exists
temp.createNewFile();
}
final FileOutputStream fileOutputStream = new FileOutputStream(temp);
{code}


was (Author: lbdai3190):
It seems that {color:#FF} final File temp = new File(file.getAbsolutePath() 
+ ".tmp");{color} will not create new file on some system.
{code:java}
//代码占位符
final File temp = new File(file.getAbsolutePath() + ".tmp");
if (!temp.exists()) {       // create .checkpoint.tmp file if it's not exists
temp.createNewFile();
}
final FileOutputStream fileOutputStream = new FileOutputStream(temp);
{code}

> /.checkpoint.tmp Not found exception
> 
>
> Key: KAFKA-5998
> URL: https://issues.apache.org/jira/browse/KAFKA-5998
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.11.0.0, 0.11.0.1, 2.1.1
>Reporter: Yogesh BG
>Priority: Major
> Attachments: 5998.v1.txt, 5998.v2.txt, Topology.txt, exc.txt, 
> props.txt, streams.txt
>
>
> I have one kafka broker and one kafka stream running... I am running its 
> since two days under load of around 2500 msgs per second.. On third day am 
> getting below exception for some of the partitions, I have 16 partitions only 
> 0_0 and 0_1 gives this error
> {{09:43:25.955 [ks_0_inst-StreamThread-6] WARN  
> o.a.k.s.p.i.ProcessorStateManager - Failed to write checkpoint file to 
> /data/kstreams/rtp-kafkastreams/0_1/.checkpoint:
> java.io.FileNotFoundException: 
> /data/kstreams/rtp-kafkastreams/0_1/.checkpoint.tmp (No such file or 
> directory)
> at java.io.FileOutputStream.open(Native Method) ~[na:1.7.0_111]
> at java.io.FileOutputStream.(FileOutputStream.java:221) 
> ~[na:1.7.0_111]
> at java.io.FileOutputStream.(FileOutputStream.java:171) 
> ~[na:1.7.0_111]
> at 
> org.apache.kafka.streams.state.internals.OffsetCheckpoint.write(OffsetCheckpoint.java:73)
>  ~[rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na]
> at 
> org.apache.kafka.streams.processor.internals.ProcessorStateManager.checkpoint(ProcessorStateManager.java:324)
>  ~[rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na]
> at 
> org.apache.kafka.streams.processor.internals.StreamTask$1.run(StreamTask.java:267)
>  [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na]
> at 
> org.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetricsImpl.java:201)
>  [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na]
> at 
> org.apache.kafka.streams.processor.internals.StreamTask.commit(StreamTask.java:260)
>  [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na]
> at 
> org.apache.kafka.streams.processor.internals.StreamTask.commit(StreamTask.java:254)
>  [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na]
> at 
> org.apache.kafka.streams.processor.internals.AssignedTasks$1.apply(AssignedTasks.java:322)
>  [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na]
> at 
> org.apache.kafka.streams.processor.internals.AssignedTasks.applyToRunningTasks(AssignedTasks.java:415)
>  [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na]
> at 
> org.apache.kafka.streams.processor.internals.AssignedTasks.commit(AssignedTasks.java:314)
>  [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na]
> at 
> org.apache.kafka.streams.processor.internals.StreamThread.commitAll(StreamThread.java:700)
>  [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na]
> at 
> org.apache.kafka.streams.processor.internals.StreamThread.maybeCommit(StreamThread.java:683)
>  [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na]
> at 
> org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:523)
>  [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na]
> at 
> org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:480)
>  [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na]
> at 
> org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:457)
>  [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na]
> 09:43:25.974 [ks_0_inst-StreamThread-15] WARN  
> o.a.k.s.p.i.ProcessorStateManager - Failed to wr

[jira] [Commented] (KAFKA-5998) /.checkpoint.tmp Not found exception

2019-03-04 Thread leibo (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-5998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16784126#comment-16784126
 ] 

leibo commented on KAFKA-5998:
--

It seems that {color:#FF} final File temp = new File(file.getAbsolutePath() 
+ ".tmp");{color} will not create new file on some system.
{code:java}
//代码占位符
final File temp = new File(file.getAbsolutePath() + ".tmp");
if (!temp.exists()) {       // create .checkpoint.tmp file if it's not exists
temp.createNewFile();
}
final FileOutputStream fileOutputStream = new FileOutputStream(temp);
{code}

> /.checkpoint.tmp Not found exception
> 
>
> Key: KAFKA-5998
> URL: https://issues.apache.org/jira/browse/KAFKA-5998
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.11.0.0, 0.11.0.1, 2.1.1
>Reporter: Yogesh BG
>Priority: Major
> Attachments: 5998.v1.txt, 5998.v2.txt, Topology.txt, exc.txt, 
> props.txt, streams.txt
>
>
> I have one kafka broker and one kafka stream running... I am running its 
> since two days under load of around 2500 msgs per second.. On third day am 
> getting below exception for some of the partitions, I have 16 partitions only 
> 0_0 and 0_1 gives this error
> {{09:43:25.955 [ks_0_inst-StreamThread-6] WARN  
> o.a.k.s.p.i.ProcessorStateManager - Failed to write checkpoint file to 
> /data/kstreams/rtp-kafkastreams/0_1/.checkpoint:
> java.io.FileNotFoundException: 
> /data/kstreams/rtp-kafkastreams/0_1/.checkpoint.tmp (No such file or 
> directory)
> at java.io.FileOutputStream.open(Native Method) ~[na:1.7.0_111]
> at java.io.FileOutputStream.(FileOutputStream.java:221) 
> ~[na:1.7.0_111]
> at java.io.FileOutputStream.(FileOutputStream.java:171) 
> ~[na:1.7.0_111]
> at 
> org.apache.kafka.streams.state.internals.OffsetCheckpoint.write(OffsetCheckpoint.java:73)
>  ~[rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na]
> at 
> org.apache.kafka.streams.processor.internals.ProcessorStateManager.checkpoint(ProcessorStateManager.java:324)
>  ~[rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na]
> at 
> org.apache.kafka.streams.processor.internals.StreamTask$1.run(StreamTask.java:267)
>  [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na]
> at 
> org.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetricsImpl.java:201)
>  [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na]
> at 
> org.apache.kafka.streams.processor.internals.StreamTask.commit(StreamTask.java:260)
>  [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na]
> at 
> org.apache.kafka.streams.processor.internals.StreamTask.commit(StreamTask.java:254)
>  [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na]
> at 
> org.apache.kafka.streams.processor.internals.AssignedTasks$1.apply(AssignedTasks.java:322)
>  [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na]
> at 
> org.apache.kafka.streams.processor.internals.AssignedTasks.applyToRunningTasks(AssignedTasks.java:415)
>  [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na]
> at 
> org.apache.kafka.streams.processor.internals.AssignedTasks.commit(AssignedTasks.java:314)
>  [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na]
> at 
> org.apache.kafka.streams.processor.internals.StreamThread.commitAll(StreamThread.java:700)
>  [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na]
> at 
> org.apache.kafka.streams.processor.internals.StreamThread.maybeCommit(StreamThread.java:683)
>  [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na]
> at 
> org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:523)
>  [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na]
> at 
> org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:480)
>  [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na]
> at 
> org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:457)
>  [rtp-kafkastreams-1.0-SNAPSHOT-jar-with-dependencies.jar:na]
> 09:43:25.974 [ks_0_inst-StreamThread-15] WARN  
> o.a.k.s.p.i.ProcessorStateManager - Failed to write checkpoint file to 
> /data/kstreams/rtp-kafkastreams/0_0/.checkpoint:
> java.io.FileNotFoundException: 
> /data/kstreams/rtp-kafkastreams/0_0/.checkpoint.tmp (No such file or 
> directory)
> at java.io.FileOutputStream.open(Native Method) ~[na:1.7.0_111]
> at java.io.FileOutputStream.(FileOutputStream.java:221) 
> ~[na:1.7.0_111]
> at java.io.FileOutputStream.(FileOutputStream.java:171) 
> ~[na:1.7.0_111]
> at 
> org.apache.kafka.streams.state.internals.OffsetCheckpoint.write(OffsetCheckpoint.java:73)
>  ~[rtp-k

[jira] [Updated] (KAFKA-7684) kafka consumer SchemaException occurred: Error reading field 'brokers':

2018-11-27 Thread leibo (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leibo updated KAFKA-7684:
-
Description: 
kafka consumer occurred org.apache.kafka.common.protocol.types.SchemaException: 
Error reading field 'brokers': Error reading array of size 65535, only 22 bytes 
available when the broker is not work due to a l*ong time full GC.*

*Exception Detail:*

org.apache.kafka.common.protocol.types.SchemaException: Error reading field 
'brokers': Error reading array of size 65535, only 22 bytes available
 at org.apache.kafka.common.protocol.types.Schema.read(Schema.java:75)
 at org.apache.kafka.common.protocol.ApiKeys.parseResponse(ApiKeys.java:153)
 at 
org.apache.kafka.clients.NetworkClient.parseStructMaybeUpdateThrottleTimeMetrics(NetworkClient.java:560)
 at 
org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:657)
 at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:442)
 at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:232)
 at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:208)
 at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:184)
 at 
org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureCoordinatorReady(AbstractCoordinator.java:217)
 at 
org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureCoordinatorReady(AbstractCoordinator.java:203)
 at 
org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:286)
 at 
org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1078)
 at 
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1043)
 at 
com.zte.ums.zenap.cometd.ReceiveKafkaMessage.run(ReceiveKafkaMessage.java:40)

*After this , the following exception occurred:*

2018-11-21 19:43:52 097 ERROR 
[org.apache.kafka.clients.consumer.KafkaConsumer][Thread-22] - Failed to close 
coordinator
java.lang.IllegalStateException: Correlation id for response (3097236) does not 
match request (3097235), request header: 
\{api_key=8,api_version=3,correlation_id=3097235,client_id=pm-task_100.100.0.43_8}
 at org.apache.kafka.clients.NetworkClient.correlate(NetworkClient.java:752)
 at 
org.apache.kafka.clients.NetworkClient.parseStructMaybeUpdateThrottleTimeMetrics(NetworkClient.java:561)
 at 
org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:657)
 at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:442)
 at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:232)
 at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:208)
 at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:199)
 at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.awaitPendingRequests(ConsumerNetworkClient.java:284)
 at 
org.apache.kafka.clients.consumer.internals.AbstractCoordinator.close(AbstractCoordinator.java:706)
 at 
org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.close(ConsumerCoordinator.java:503)
 at 
org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:1605)
 at 
org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:1573)
 at 
org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:1549)
 at 
com.zte.ums.zenap.pm.common.kafka.BasicKafkaConsumerWorker$1.run(BasicKafkaConsumerWorker.java:77)
 at java.lang.Thread.run(Thread.java:811)

18-11-21 19:43:52 099 ERROR 
[com.zte.ums.zenap.dw.ext.FrameWorkThreadGroup][Thread-24] - Thread-24 throw 
this exception 
org.apache.kafka.common.KafkaException: Failed to close kafka consumer
 at 
org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:1623)
 at 
org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:1573)
 at 
org.apache.kafka.clients.consumer.KafkaConsumer.close(KafkaConsumer.java:1549)
 at 
com.zte.ums.zenap.pm.common.kafka.BasicKafkaConsumerWorker$1.run(BasicKafkaConsumerWorker.java:77)
 at java.lang.Thread.run(Thread.java:811)

 

Kafka broker version: *0.11.0.1*

Kafka consumer version: *0.11.0.1*

*More Info:*

In the same time , the kafka broker have a long time(about 16 seconds) Full GC 
, and lots of consumer rebalance and zookeeper session timeout occurred. 

 

  was:
kafka consumer occurred org.apache.kafka.common.protocol.types.SchemaException: 
Error reading field 'brokers': Error reading array of size 65535, only 22 bytes 
available when the broker is not work due to a long time full GC.

Exception Detail:

org.apache.kafka.common.protocol.types.SchemaException: Error reading field 
'brokers': Error reading array of size 65535, only 22 bytes available
 at org.apache.kafka.commo

[jira] [Created] (KAFKA-7684) kafka consumer SchemaException occurred: Error reading field 'brokers':

2018-11-27 Thread leibo (JIRA)
leibo created KAFKA-7684:


 Summary: kafka consumer SchemaException occurred: Error reading 
field 'brokers': 
 Key: KAFKA-7684
 URL: https://issues.apache.org/jira/browse/KAFKA-7684
 Project: Kafka
  Issue Type: Bug
  Components: consumer
Affects Versions: 0.11.0.1
Reporter: leibo


kafka consumer occurred org.apache.kafka.common.protocol.types.SchemaException: 
Error reading field 'brokers': Error reading array of size 65535, only 22 bytes 
available when the broker is not work due to a long time full GC.

Exception Detail:

org.apache.kafka.common.protocol.types.SchemaException: Error reading field 
'brokers': Error reading array of size 65535, only 22 bytes available
 at org.apache.kafka.common.protocol.types.Schema.read(Schema.java:75)
 at org.apache.kafka.common.protocol.ApiKeys.parseResponse(ApiKeys.java:153)
 at 
org.apache.kafka.clients.NetworkClient.parseStructMaybeUpdateThrottleTimeMetrics(NetworkClient.java:560)
 at 
org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:657)
 at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:442)
 at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:232)
 at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:208)
 at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:184)
 at 
org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureCoordinatorReady(AbstractCoordinator.java:217)
 at 
org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureCoordinatorReady(AbstractCoordinator.java:203)
 at 
org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:286)
 at 
org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1078)
 at 
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1043)
 at 
com.zte.ums.zenap.cometd.ReceiveKafkaMessage.run(ReceiveKafkaMessage.java:40)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)