[jira] [Commented] (KAFKA-5152) Kafka Streams keeps restoring state after shutdown is initiated during startup
[ https://issues.apache.org/jira/browse/KAFKA-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16193865#comment-16193865 ] ASF GitHub Bot commented on KAFKA-5152: --- Github user guozhangwang closed the pull request at: https://github.com/apache/kafka/pull/3607 > Kafka Streams keeps restoring state after shutdown is initiated during startup > -- > > Key: KAFKA-5152 > URL: https://issues.apache.org/jira/browse/KAFKA-5152 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 0.10.2.1 >Reporter: Xavier Léauté >Assignee: Damian Guy >Priority: Blocker > Fix For: 0.11.0.1, 1.0.0 > > > If streams shutdown is initiated during state restore (e.g. an uncaught > exception is thrown) streams will not shut down until all stores are first > finished restoring. > As restore progresses, stream threads appear to be taken out of service as > part of the shutdown sequence, causing rebalancing of tasks. This compounds > the problem by slowing down the restore process even further, since the > remaining threads now have to also restore the reassigned tasks before they > can shut down. > A more severe issue is that if there is a new rebalance triggered during the > end of the waitingSync phase (e.g. due to a new member joining the group, or > some members timed out the SyncGroup response), then some consumer clients of > the group may already proceed with the {{onPartitionsAssigned}} and blocked > on trying to grab the file dir lock not yet released from other clients, > while the other clients holding the lock are consistently re-sending > {{JoinGroup}} requests while the rebalance cannot be completed because the > clients blocked on the file dir lock will not be kicked out of the group as > its heartbeat thread has been consistently sending HBRequest. Hence this is a > deadlock caused by not releasing the file dir locks in task suspension. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-5152) Kafka Streams keeps restoring state after shutdown is initiated during startup
[ https://issues.apache.org/jira/browse/KAFKA-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16137135#comment-16137135 ] ASF GitHub Bot commented on KAFKA-5152: --- Github user asfgit closed the pull request at: https://github.com/apache/kafka/pull/3675 > Kafka Streams keeps restoring state after shutdown is initiated during startup > -- > > Key: KAFKA-5152 > URL: https://issues.apache.org/jira/browse/KAFKA-5152 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 0.10.2.1 >Reporter: Xavier Léauté >Assignee: Damian Guy >Priority: Blocker > Fix For: 0.11.0.1, 1.0.0 > > > If streams shutdown is initiated during state restore (e.g. an uncaught > exception is thrown) streams will not shut down until all stores are first > finished restoring. > As restore progresses, stream threads appear to be taken out of service as > part of the shutdown sequence, causing rebalancing of tasks. This compounds > the problem by slowing down the restore process even further, since the > remaining threads now have to also restore the reassigned tasks before they > can shut down. > A more severe issue is that if there is a new rebalance triggered during the > end of the waitingSync phase (e.g. due to a new member joining the group, or > some members timed out the SyncGroup response), then some consumer clients of > the group may already proceed with the {{onPartitionsAssigned}} and blocked > on trying to grab the file dir lock not yet released from other clients, > while the other clients holding the lock are consistently re-sending > {{JoinGroup}} requests while the rebalance cannot be completed because the > clients blocked on the file dir lock will not be kicked out of the group as > its heartbeat thread has been consistently sending HBRequest. Hence this is a > deadlock caused by not releasing the file dir locks in task suspension. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-5152) Kafka Streams keeps restoring state after shutdown is initiated during startup
[ https://issues.apache.org/jira/browse/KAFKA-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128797#comment-16128797 ] ASF GitHub Bot commented on KAFKA-5152: --- GitHub user dguy opened a pull request: https://github.com/apache/kafka/pull/3675 KAFKA-5152: perform state restoration in poll loop You can merge this pull request into a Git repository by running: $ git pull https://github.com/dguy/kafka kafka-5152 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/kafka/pull/3675.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3675 commit 4da235927f9eebf2442533b44bf017454d372747 Author: Damian Guy Date: 2017-08-14T17:23:53Z perform state restoration in poll loop > Kafka Streams keeps restoring state after shutdown is initiated during startup > -- > > Key: KAFKA-5152 > URL: https://issues.apache.org/jira/browse/KAFKA-5152 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 0.10.2.1 >Reporter: Xavier Léauté >Assignee: Damian Guy >Priority: Blocker > Fix For: 0.11.0.1, 1.0.0 > > > If streams shutdown is initiated during state restore (e.g. an uncaught > exception is thrown) streams will not shut down until all stores are first > finished restoring. > As restore progresses, stream threads appear to be taken out of service as > part of the shutdown sequence, causing rebalancing of tasks. This compounds > the problem by slowing down the restore process even further, since the > remaining threads now have to also restore the reassigned tasks before they > can shut down. > A more severe issue is that if there is a new rebalance triggered during the > end of the waitingSync phase (e.g. due to a new member joining the group, or > some members timed out the SyncGroup response), then some consumer clients of > the group may already proceed with the {{onPartitionsAssigned}} and blocked > on trying to grab the file dir lock not yet released from other clients, > while the other clients holding the lock are consistently re-sending > {{JoinGroup}} requests while the rebalance cannot be completed because the > clients blocked on the file dir lock will not be kicked out of the group as > its heartbeat thread has been consistently sending HBRequest. Hence this is a > deadlock caused by not releasing the file dir locks in task suspension. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-5152) Kafka Streams keeps restoring state after shutdown is initiated during startup
[ https://issues.apache.org/jira/browse/KAFKA-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128595#comment-16128595 ] ASF GitHub Bot commented on KAFKA-5152: --- Github user dguy closed the pull request at: https://github.com/apache/kafka/pull/3653 > Kafka Streams keeps restoring state after shutdown is initiated during startup > -- > > Key: KAFKA-5152 > URL: https://issues.apache.org/jira/browse/KAFKA-5152 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 0.10.2.1 >Reporter: Xavier Léauté >Assignee: Damian Guy >Priority: Blocker > Fix For: 0.10.2.2, 0.11.0.1 > > > If streams shutdown is initiated during state restore (e.g. an uncaught > exception is thrown) streams will not shut down until all stores are first > finished restoring. > As restore progresses, stream threads appear to be taken out of service as > part of the shutdown sequence, causing rebalancing of tasks. This compounds > the problem by slowing down the restore process even further, since the > remaining threads now have to also restore the reassigned tasks before they > can shut down. > A more severe issue is that if there is a new rebalance triggered during the > end of the waitingSync phase (e.g. due to a new member joining the group, or > some members timed out the SyncGroup response), then some consumer clients of > the group may already proceed with the {{onPartitionsAssigned}} and blocked > on trying to grab the file dir lock not yet released from other clients, > while the other clients holding the lock are consistently re-sending > {{JoinGroup}} requests while the rebalance cannot be completed because the > clients blocked on the file dir lock will not be kicked out of the group as > its heartbeat thread has been consistently sending HBRequest. Hence this is a > deadlock caused by not releasing the file dir locks in task suspension. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-5152) Kafka Streams keeps restoring state after shutdown is initiated during startup
[ https://issues.apache.org/jira/browse/KAFKA-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121510#comment-16121510 ] ASF GitHub Bot commented on KAFKA-5152: --- GitHub user dguy opened a pull request: https://github.com/apache/kafka/pull/3653 KAFKA-5152: move state restoration out of rebalance and into poll loop In `onPartitionsAssigned`: 1. release all locks for non-assigned suspended tasks. 2. resume any suspended tasks. 3. Create new tasks, but don't attempt to take the state lock. 4. Pause partitions for any new tasks. 5. set the state to `PARTITIONS_ASSIGNED` In `StreamThread#runLoop` 1. poll 2. if state is `PARTITIONS_ASSIGNED` 2.1 attempt to initialize any new tasks, i.e, take out the state locks and init state stores 2.2 restore some data for changelogs, i.e., poll once on the restore consumer and return the partitions that have been fully restored 2.3 update tasks with restored partitions and move any that have completed restoration to running 2.4 resume consumption for any tasks where all partitions have been restored. 2.5 if all active tasks are running, transition to `RUNNING` and assign standby partitions to the restoreConsumer. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dguy/kafka 0.11.0-restore-on-poll Alternatively you can review and apply these changes as the patch at: https://github.com/apache/kafka/pull/3653.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3653 commit 27016b9e9706ee95bcedd9a1408c71e62a0f178e Author: Damian Guy Date: 2017-08-09T19:02:17Z restore state on the poll loop > Kafka Streams keeps restoring state after shutdown is initiated during startup > -- > > Key: KAFKA-5152 > URL: https://issues.apache.org/jira/browse/KAFKA-5152 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 0.10.2.1 >Reporter: Xavier Léauté >Assignee: Damian Guy >Priority: Blocker > Fix For: 0.10.2.2, 0.11.0.1 > > > If streams shutdown is initiated during state restore (e.g. an uncaught > exception is thrown) streams will not shut down until all stores are first > finished restoring. > As restore progresses, stream threads appear to be taken out of service as > part of the shutdown sequence, causing rebalancing of tasks. This compounds > the problem by slowing down the restore process even further, since the > remaining threads now have to also restore the reassigned tasks before they > can shut down. > A more severe issue is that if there is a new rebalance triggered during the > end of the waitingSync phase (e.g. due to a new member joining the group, or > some members timed out the SyncGroup response), then some consumer clients of > the group may already proceed with the {{onPartitionsAssigned}} and blocked > on trying to grab the file dir lock not yet released from other clients, > while the other clients holding the lock are consistently re-sending > {{JoinGroup}} requests while the rebalance cannot be completed because the > clients blocked on the file dir lock will not be kicked out of the group as > its heartbeat thread has been consistently sending HBRequest. Hence this is a > deadlock caused by not releasing the file dir locks in task suspension. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-5152) Kafka Streams keeps restoring state after shutdown is initiated during startup
[ https://issues.apache.org/jira/browse/KAFKA-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16110017#comment-16110017 ] ASF GitHub Bot commented on KAFKA-5152: --- GitHub user guozhangwang opened a pull request: https://github.com/apache/kafka/pull/3607 [DO NOT MERGE] Existing StreamThread exception handling issues This is for @dguy as a reference while working on the first step of KAFKA-5152, as a list of existing issues that need to be address at stream thread layer. You can merge this pull request into a Git repository by running: $ git pull https://github.com/guozhangwang/kafka KMinor-stream-thread-exception-handling Alternatively you can review and apply these changes as the patch at: https://github.com/apache/kafka/pull/3607.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3607 commit 2d45430191c3dc417992c08454f9c550c1e6bb93 Author: Guozhang Wang Date: 2017-07-25T22:43:18Z handle commit failed exception on stream thread commit 9655791794dfa2623ba9f109676b112779fdceca Author: Guozhang Wang Date: 2017-07-26T00:39:07Z minor fixes commit 26226d61529007acc0ccc151e6f6675fc9757d34 Author: Guozhang Wang Date: 2017-07-26T01:01:57Z add a bunch of TODOs for exception handling commit 3b054f556364be04d7f83a40b212e0c7facc4a23 Author: Guozhang Wang Date: 2017-07-27T22:09:43Z rebase from trunk commit 4b0f4f9cb30537fd0b45b192e2a5d81005ffa3c5 Author: Guozhang Wang Date: 2017-07-27T22:26:46Z minor fixes commit 5d2dffa72443139909d3e28f1684363a6e6f5585 Author: Guozhang Wang Date: 2017-07-27T22:29:39Z github comments commit 41ba5721ec9fe88b91416621a6236794d37a74de Author: Guozhang Wang Date: 2017-08-02T00:08:13Z rebase from trunk > Kafka Streams keeps restoring state after shutdown is initiated during startup > -- > > Key: KAFKA-5152 > URL: https://issues.apache.org/jira/browse/KAFKA-5152 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 0.10.2.1 >Reporter: Xavier Léauté >Assignee: Matthias J. Sax >Priority: Blocker > Fix For: 0.10.2.2, 0.11.0.1 > > > If streams shutdown is initiated during state restore (e.g. an uncaught > exception is thrown) streams will not shut down until all stores are first > finished restoring. > As restore progresses, stream threads appear to be taken out of service as > part of the shutdown sequence, causing rebalancing of tasks. This compounds > the problem by slowing down the restore process even further, since the > remaining threads now have to also restore the reassigned tasks before they > can shut down. > A more severe issue is that if there is a new rebalance triggered during the > end of the waitingSync phase (e.g. due to a new member joining the group, or > some members timed out the SyncGroup response), then some consumer clients of > the group may already proceed with the {{onPartitionsAssigned}} and blocked > on trying to grab the file dir lock not yet released from other clients, > while the other clients holding the lock are consistently re-sending > {{JoinGroup}} requests while the rebalance cannot be completed because the > clients blocked on the file dir lock will not be kicked out of the group as > its heartbeat thread has been consistently sending HBRequest. Hence this is a > deadlock caused by not releasing the file dir locks in task suspension. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-5152) Kafka Streams keeps restoring state after shutdown is initiated during startup
[ https://issues.apache.org/jira/browse/KAFKA-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16092330#comment-16092330 ] Guozhang Wang commented on KAFKA-5152: -- Here are a couple alternative solutions to resolve this issue: 1. Revert the task-suspension optimization, this is the worst-case solution we can get to work around it. 2. Move the restoration process completely out of the {{onPartitionsAssigned}} callback, in the thread's main while loop. More specifically the workflow of the thread will be: a) only create the active / standby tasks without executing the restoration of states on the active tasks; release the dir file locks in {{onPartitionAssigned}} for those suspended-and-not-reassigned tasks. Mark all the created active tasks as not-ready first and pause all these task's corresponding source topic-partitions. b) whose state has not been restored up to date (for those suspended-and-reassigned tasks double check that the they should not be included since their state should be up-to-date) b) in the main loop: b.1) check for all the not-ready tasks and see if their stores have completed restoration (restored offset = logend offset), if yes mark these tasks as active now and resume their corresponding source topic-partitions. b.2) process / punctuate / commit if possible for active tasks which have already some records fetched. b.3) restore the state for both standby tasks and active-not-ready tasks. One thing needed for care though, is that since now tasks within the same thread can start processing at different time, but for IQ we still need to only make the instance / thread as RUNNING after all its tasks are now running. > Kafka Streams keeps restoring state after shutdown is initiated during startup > -- > > Key: KAFKA-5152 > URL: https://issues.apache.org/jira/browse/KAFKA-5152 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 0.10.2.1 >Reporter: Xavier Léauté >Assignee: Matthias J. Sax > Fix For: 0.10.2.2, 0.11.0.1 > > > If streams shutdown is initiated during state restore (e.g. an uncaught > exception is thrown) streams will not shut down until all stores are first > finished restoring. > As restore progresses, stream threads appear to be taken out of service as > part of the shutdown sequence, causing rebalancing of tasks. This compounds > the problem by slowing down the restore process even further, since the > remaining threads now have to also restore the reassigned tasks before they > can shut down. > A more severe issue is that if there is a new rebalance triggered during the > end of the waitingSync phase (e.g. due to a new member joining the group, or > some members timed out the SyncGroup response), then some consumer clients of > the group may already proceed with the {{onPartitionsAssigned}} and blocked > on trying to grab the file dir lock not yet released from other clients, > while the other clients holding the lock are consistently re-sending > {{JoinGroup}} requests while the rebalance cannot be completed because the > clients blocked on the file dir lock will not be kicked out of the group as > its heartbeat thread has been consistently sending HBRequest. Hence this is a > deadlock caused by not releasing the file dir locks in task suspension. -- This message was sent by Atlassian JIRA (v6.4.14#64029)