[jira] [Updated] (KAFKA-5571) Possible deadlock during shutdown in setState in kafka streams 10.2
[ https://issues.apache.org/jira/browse/KAFKA-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias J. Sax updated KAFKA-5571: --- Fix Version/s: (was: 1.1.0) 1.0.0 > Possible deadlock during shutdown in setState in kafka streams 10.2 > --- > > Key: KAFKA-5571 > URL: https://issues.apache.org/jira/browse/KAFKA-5571 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 0.10.2.1 >Reporter: Greg Fodor >Assignee: Eno Thereska >Priority: Major > Fix For: 1.0.0 > > Attachments: kafka-streams.deadlock.log > > > I'm running a 10.2 job across 5 nodes with 32 stream threads on each node and > find that when gracefully shutdown all of them at once via an ansible > scripts, some of the nodes end up freezing -- at a glance the attached thread > dump implies a deadlock between stream threads trying to update their state > via setState. We haven't had this problem before but it may or may not be > related to changes in 10.2 (we are upgrading from 10.0 to 10.2) > when we gracefully shutdown all nodes simultaneously, what typically happens > is some subset of the nodes end up not shutting down completely but end up > going through a rebalance first. it seems this deadlock requires this > rebalancing to occur simultaneously with the graceful shutdown. if we happen > to shut them down and no rebalance happens, i don't believe this deadlock is > triggered. > the deadlock appears related to the state change handlers being subscribed > across threads and the fact that both StreamThread#setState and > StreamStateListener#onChange are both synchronized methods. > Another thing worth mentioning is that one of the transformers used in the > job has a close() method that can take 10-15 seconds to finish since it needs > to flush some data to a database. Having a long close() method combined with > a rebalance during a shutdown across many threads may be necessary for > reproduction. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (KAFKA-5571) Possible deadlock during shutdown in setState in kafka streams 10.2
[ https://issues.apache.org/jira/browse/KAFKA-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias J. Sax updated KAFKA-5571: --- Fix Version/s: 1.1.0 > Possible deadlock during shutdown in setState in kafka streams 10.2 > --- > > Key: KAFKA-5571 > URL: https://issues.apache.org/jira/browse/KAFKA-5571 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 0.10.2.1 >Reporter: Greg Fodor >Assignee: Eno Thereska >Priority: Major > Fix For: 1.1.0 > > Attachments: kafka-streams.deadlock.log > > > I'm running a 10.2 job across 5 nodes with 32 stream threads on each node and > find that when gracefully shutdown all of them at once via an ansible > scripts, some of the nodes end up freezing -- at a glance the attached thread > dump implies a deadlock between stream threads trying to update their state > via setState. We haven't had this problem before but it may or may not be > related to changes in 10.2 (we are upgrading from 10.0 to 10.2) > when we gracefully shutdown all nodes simultaneously, what typically happens > is some subset of the nodes end up not shutting down completely but end up > going through a rebalance first. it seems this deadlock requires this > rebalancing to occur simultaneously with the graceful shutdown. if we happen > to shut them down and no rebalance happens, i don't believe this deadlock is > triggered. > the deadlock appears related to the state change handlers being subscribed > across threads and the fact that both StreamThread#setState and > StreamStateListener#onChange are both synchronized methods. > Another thing worth mentioning is that one of the transformers used in the > job has a close() method that can take 10-15 seconds to finish since it needs > to flush some data to a database. Having a long close() method combined with > a rebalance during a shutdown across many threads may be necessary for > reproduction. -- This message was sent by Atlassian JIRA (v7.6.3#76005)