[jira] [Updated] (KAFKA-5571) Possible deadlock during shutdown in setState in kafka streams 10.2

2019-03-01 Thread Matthias J. Sax (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias J. Sax updated KAFKA-5571:
---
Fix Version/s: (was: 1.1.0)
   1.0.0

> Possible deadlock during shutdown in setState in kafka streams 10.2
> ---
>
> Key: KAFKA-5571
> URL: https://issues.apache.org/jira/browse/KAFKA-5571
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.10.2.1
>Reporter: Greg Fodor
>Assignee: Eno Thereska
>Priority: Major
> Fix For: 1.0.0
>
> Attachments: kafka-streams.deadlock.log
>
>
> I'm running a 10.2 job across 5 nodes with 32 stream threads on each node and 
> find that when gracefully shutdown all of them at once via an ansible 
> scripts, some of the nodes end up freezing -- at a glance the attached thread 
> dump implies a deadlock between stream threads trying to update their state 
> via setState. We haven't had this problem before but it may or may not be 
> related to changes in 10.2 (we are upgrading from 10.0 to 10.2)
> when we gracefully shutdown all nodes simultaneously, what typically happens 
> is some subset of the nodes end up not shutting down completely but end up 
> going through a rebalance first. it seems this deadlock requires this 
> rebalancing to occur simultaneously with the graceful shutdown. if we happen 
> to shut them down and no rebalance happens, i don't believe this deadlock is 
> triggered.
> the deadlock appears related to the state change handlers being subscribed 
> across threads and the fact that both StreamThread#setState and 
> StreamStateListener#onChange are both synchronized methods.
> Another thing worth mentioning is that one of the transformers used in the 
> job has a close() method that can take 10-15 seconds to finish since it needs 
> to flush some data to a database. Having a long close() method combined with 
> a rebalance during a shutdown across many threads may be necessary for 
> reproduction.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-5571) Possible deadlock during shutdown in setState in kafka streams 10.2

2019-03-01 Thread Matthias J. Sax (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias J. Sax updated KAFKA-5571:
---
Fix Version/s: 1.1.0

> Possible deadlock during shutdown in setState in kafka streams 10.2
> ---
>
> Key: KAFKA-5571
> URL: https://issues.apache.org/jira/browse/KAFKA-5571
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.10.2.1
>Reporter: Greg Fodor
>Assignee: Eno Thereska
>Priority: Major
> Fix For: 1.1.0
>
> Attachments: kafka-streams.deadlock.log
>
>
> I'm running a 10.2 job across 5 nodes with 32 stream threads on each node and 
> find that when gracefully shutdown all of them at once via an ansible 
> scripts, some of the nodes end up freezing -- at a glance the attached thread 
> dump implies a deadlock between stream threads trying to update their state 
> via setState. We haven't had this problem before but it may or may not be 
> related to changes in 10.2 (we are upgrading from 10.0 to 10.2)
> when we gracefully shutdown all nodes simultaneously, what typically happens 
> is some subset of the nodes end up not shutting down completely but end up 
> going through a rebalance first. it seems this deadlock requires this 
> rebalancing to occur simultaneously with the graceful shutdown. if we happen 
> to shut them down and no rebalance happens, i don't believe this deadlock is 
> triggered.
> the deadlock appears related to the state change handlers being subscribed 
> across threads and the fact that both StreamThread#setState and 
> StreamStateListener#onChange are both synchronized methods.
> Another thing worth mentioning is that one of the transformers used in the 
> job has a close() method that can take 10-15 seconds to finish since it needs 
> to flush some data to a database. Having a long close() method combined with 
> a rebalance during a shutdown across many threads may be necessary for 
> reproduction.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)