[jira] [Commented] (KAFKA-5152) Kafka Streams keeps restoring state after shutdown is initiated during startup

2017-10-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16193865#comment-16193865
 ] 

ASF GitHub Bot commented on KAFKA-5152:
---

Github user guozhangwang closed the pull request at:

https://github.com/apache/kafka/pull/3607


> Kafka Streams keeps restoring state after shutdown is initiated during startup
> --
>
> Key: KAFKA-5152
> URL: https://issues.apache.org/jira/browse/KAFKA-5152
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.10.2.1
>Reporter: Xavier Léauté
>Assignee: Damian Guy
>Priority: Blocker
> Fix For: 0.11.0.1, 1.0.0
>
>
> If streams shutdown is initiated during state restore (e.g. an uncaught 
> exception is thrown) streams will not shut down until all stores are first 
> finished restoring.
> As restore progresses, stream threads appear to be taken out of service as 
> part of the shutdown sequence, causing rebalancing of tasks. This compounds 
> the problem by slowing down the restore process even further, since the 
> remaining threads now have to also restore the reassigned tasks before they 
> can shut down.
> A more severe issue is that if there is a new rebalance triggered during the 
> end of the waitingSync phase (e.g. due to a new member joining the group, or 
> some members timed out the SyncGroup response), then some consumer clients of 
> the group may already proceed with the {{onPartitionsAssigned}} and blocked 
> on trying to grab the file dir lock not yet released from other clients, 
> while the other clients holding the lock are consistently re-sending 
> {{JoinGroup}} requests while the rebalance cannot be completed because the 
> clients blocked on the file dir lock will not be kicked out of the group as 
> its heartbeat thread has been consistently sending HBRequest. Hence this is a 
> deadlock caused by not releasing the file dir locks in task suspension.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5152) Kafka Streams keeps restoring state after shutdown is initiated during startup

2017-08-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16137135#comment-16137135
 ] 

ASF GitHub Bot commented on KAFKA-5152:
---

Github user asfgit closed the pull request at:

https://github.com/apache/kafka/pull/3675


> Kafka Streams keeps restoring state after shutdown is initiated during startup
> --
>
> Key: KAFKA-5152
> URL: https://issues.apache.org/jira/browse/KAFKA-5152
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.10.2.1
>Reporter: Xavier Léauté
>Assignee: Damian Guy
>Priority: Blocker
> Fix For: 0.11.0.1, 1.0.0
>
>
> If streams shutdown is initiated during state restore (e.g. an uncaught 
> exception is thrown) streams will not shut down until all stores are first 
> finished restoring.
> As restore progresses, stream threads appear to be taken out of service as 
> part of the shutdown sequence, causing rebalancing of tasks. This compounds 
> the problem by slowing down the restore process even further, since the 
> remaining threads now have to also restore the reassigned tasks before they 
> can shut down.
> A more severe issue is that if there is a new rebalance triggered during the 
> end of the waitingSync phase (e.g. due to a new member joining the group, or 
> some members timed out the SyncGroup response), then some consumer clients of 
> the group may already proceed with the {{onPartitionsAssigned}} and blocked 
> on trying to grab the file dir lock not yet released from other clients, 
> while the other clients holding the lock are consistently re-sending 
> {{JoinGroup}} requests while the rebalance cannot be completed because the 
> clients blocked on the file dir lock will not be kicked out of the group as 
> its heartbeat thread has been consistently sending HBRequest. Hence this is a 
> deadlock caused by not releasing the file dir locks in task suspension.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5152) Kafka Streams keeps restoring state after shutdown is initiated during startup

2017-08-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128797#comment-16128797
 ] 

ASF GitHub Bot commented on KAFKA-5152:
---

GitHub user dguy opened a pull request:

https://github.com/apache/kafka/pull/3675

KAFKA-5152: perform state restoration in poll loop



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dguy/kafka kafka-5152

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/3675.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3675


commit 4da235927f9eebf2442533b44bf017454d372747
Author: Damian Guy 
Date:   2017-08-14T17:23:53Z

perform state restoration in poll loop




> Kafka Streams keeps restoring state after shutdown is initiated during startup
> --
>
> Key: KAFKA-5152
> URL: https://issues.apache.org/jira/browse/KAFKA-5152
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.10.2.1
>Reporter: Xavier Léauté
>Assignee: Damian Guy
>Priority: Blocker
> Fix For: 0.11.0.1, 1.0.0
>
>
> If streams shutdown is initiated during state restore (e.g. an uncaught 
> exception is thrown) streams will not shut down until all stores are first 
> finished restoring.
> As restore progresses, stream threads appear to be taken out of service as 
> part of the shutdown sequence, causing rebalancing of tasks. This compounds 
> the problem by slowing down the restore process even further, since the 
> remaining threads now have to also restore the reassigned tasks before they 
> can shut down.
> A more severe issue is that if there is a new rebalance triggered during the 
> end of the waitingSync phase (e.g. due to a new member joining the group, or 
> some members timed out the SyncGroup response), then some consumer clients of 
> the group may already proceed with the {{onPartitionsAssigned}} and blocked 
> on trying to grab the file dir lock not yet released from other clients, 
> while the other clients holding the lock are consistently re-sending 
> {{JoinGroup}} requests while the rebalance cannot be completed because the 
> clients blocked on the file dir lock will not be kicked out of the group as 
> its heartbeat thread has been consistently sending HBRequest. Hence this is a 
> deadlock caused by not releasing the file dir locks in task suspension.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5152) Kafka Streams keeps restoring state after shutdown is initiated during startup

2017-08-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128595#comment-16128595
 ] 

ASF GitHub Bot commented on KAFKA-5152:
---

Github user dguy closed the pull request at:

https://github.com/apache/kafka/pull/3653


> Kafka Streams keeps restoring state after shutdown is initiated during startup
> --
>
> Key: KAFKA-5152
> URL: https://issues.apache.org/jira/browse/KAFKA-5152
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.10.2.1
>Reporter: Xavier Léauté
>Assignee: Damian Guy
>Priority: Blocker
> Fix For: 0.10.2.2, 0.11.0.1
>
>
> If streams shutdown is initiated during state restore (e.g. an uncaught 
> exception is thrown) streams will not shut down until all stores are first 
> finished restoring.
> As restore progresses, stream threads appear to be taken out of service as 
> part of the shutdown sequence, causing rebalancing of tasks. This compounds 
> the problem by slowing down the restore process even further, since the 
> remaining threads now have to also restore the reassigned tasks before they 
> can shut down.
> A more severe issue is that if there is a new rebalance triggered during the 
> end of the waitingSync phase (e.g. due to a new member joining the group, or 
> some members timed out the SyncGroup response), then some consumer clients of 
> the group may already proceed with the {{onPartitionsAssigned}} and blocked 
> on trying to grab the file dir lock not yet released from other clients, 
> while the other clients holding the lock are consistently re-sending 
> {{JoinGroup}} requests while the rebalance cannot be completed because the 
> clients blocked on the file dir lock will not be kicked out of the group as 
> its heartbeat thread has been consistently sending HBRequest. Hence this is a 
> deadlock caused by not releasing the file dir locks in task suspension.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5152) Kafka Streams keeps restoring state after shutdown is initiated during startup

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121510#comment-16121510
 ] 

ASF GitHub Bot commented on KAFKA-5152:
---

GitHub user dguy opened a pull request:

https://github.com/apache/kafka/pull/3653

KAFKA-5152: move state restoration out of rebalance and into poll loop

In `onPartitionsAssigned`: 
1. release all locks for non-assigned suspended tasks.
2. resume any suspended tasks.
3. Create new tasks, but don't attempt to take the state lock.
4. Pause partitions for any new tasks.
5. set the state to `PARTITIONS_ASSIGNED`

In `StreamThread#runLoop`
1. poll
2. if state is `PARTITIONS_ASSIGNED`
 2.1  attempt to initialize any new tasks, i.e, take out the state locks 
and init state stores
 2.2 restore some data for changelogs, i.e., poll once on the restore 
consumer and return the partitions that have been fully restored
 2.3 update tasks with restored partitions and move any that have completed 
restoration to running
 2.4 resume consumption for any tasks where all partitions have been 
restored.
 2.5 if all active tasks are running, transition to `RUNNING` and assign 
standby partitions to the restoreConsumer.
 


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dguy/kafka 0.11.0-restore-on-poll

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/3653.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3653


commit 27016b9e9706ee95bcedd9a1408c71e62a0f178e
Author: Damian Guy 
Date:   2017-08-09T19:02:17Z

restore state on the poll loop




> Kafka Streams keeps restoring state after shutdown is initiated during startup
> --
>
> Key: KAFKA-5152
> URL: https://issues.apache.org/jira/browse/KAFKA-5152
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.10.2.1
>Reporter: Xavier Léauté
>Assignee: Damian Guy
>Priority: Blocker
> Fix For: 0.10.2.2, 0.11.0.1
>
>
> If streams shutdown is initiated during state restore (e.g. an uncaught 
> exception is thrown) streams will not shut down until all stores are first 
> finished restoring.
> As restore progresses, stream threads appear to be taken out of service as 
> part of the shutdown sequence, causing rebalancing of tasks. This compounds 
> the problem by slowing down the restore process even further, since the 
> remaining threads now have to also restore the reassigned tasks before they 
> can shut down.
> A more severe issue is that if there is a new rebalance triggered during the 
> end of the waitingSync phase (e.g. due to a new member joining the group, or 
> some members timed out the SyncGroup response), then some consumer clients of 
> the group may already proceed with the {{onPartitionsAssigned}} and blocked 
> on trying to grab the file dir lock not yet released from other clients, 
> while the other clients holding the lock are consistently re-sending 
> {{JoinGroup}} requests while the rebalance cannot be completed because the 
> clients blocked on the file dir lock will not be kicked out of the group as 
> its heartbeat thread has been consistently sending HBRequest. Hence this is a 
> deadlock caused by not releasing the file dir locks in task suspension.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5152) Kafka Streams keeps restoring state after shutdown is initiated during startup

2017-08-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16110017#comment-16110017
 ] 

ASF GitHub Bot commented on KAFKA-5152:
---

GitHub user guozhangwang opened a pull request:

https://github.com/apache/kafka/pull/3607

[DO NOT MERGE] Existing StreamThread exception handling issues

This is for @dguy as a reference while working on the first step of 
KAFKA-5152, as a list of existing issues that need to be address at stream 
thread layer.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/guozhangwang/kafka 
KMinor-stream-thread-exception-handling

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/3607.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3607


commit 2d45430191c3dc417992c08454f9c550c1e6bb93
Author: Guozhang Wang 
Date:   2017-07-25T22:43:18Z

handle commit failed exception on stream thread

commit 9655791794dfa2623ba9f109676b112779fdceca
Author: Guozhang Wang 
Date:   2017-07-26T00:39:07Z

minor fixes

commit 26226d61529007acc0ccc151e6f6675fc9757d34
Author: Guozhang Wang 
Date:   2017-07-26T01:01:57Z

add a bunch of TODOs for exception handling

commit 3b054f556364be04d7f83a40b212e0c7facc4a23
Author: Guozhang Wang 
Date:   2017-07-27T22:09:43Z

rebase from trunk

commit 4b0f4f9cb30537fd0b45b192e2a5d81005ffa3c5
Author: Guozhang Wang 
Date:   2017-07-27T22:26:46Z

minor fixes

commit 5d2dffa72443139909d3e28f1684363a6e6f5585
Author: Guozhang Wang 
Date:   2017-07-27T22:29:39Z

github comments

commit 41ba5721ec9fe88b91416621a6236794d37a74de
Author: Guozhang Wang 
Date:   2017-08-02T00:08:13Z

rebase from trunk




> Kafka Streams keeps restoring state after shutdown is initiated during startup
> --
>
> Key: KAFKA-5152
> URL: https://issues.apache.org/jira/browse/KAFKA-5152
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.10.2.1
>Reporter: Xavier Léauté
>Assignee: Matthias J. Sax
>Priority: Blocker
> Fix For: 0.10.2.2, 0.11.0.1
>
>
> If streams shutdown is initiated during state restore (e.g. an uncaught 
> exception is thrown) streams will not shut down until all stores are first 
> finished restoring.
> As restore progresses, stream threads appear to be taken out of service as 
> part of the shutdown sequence, causing rebalancing of tasks. This compounds 
> the problem by slowing down the restore process even further, since the 
> remaining threads now have to also restore the reassigned tasks before they 
> can shut down.
> A more severe issue is that if there is a new rebalance triggered during the 
> end of the waitingSync phase (e.g. due to a new member joining the group, or 
> some members timed out the SyncGroup response), then some consumer clients of 
> the group may already proceed with the {{onPartitionsAssigned}} and blocked 
> on trying to grab the file dir lock not yet released from other clients, 
> while the other clients holding the lock are consistently re-sending 
> {{JoinGroup}} requests while the rebalance cannot be completed because the 
> clients blocked on the file dir lock will not be kicked out of the group as 
> its heartbeat thread has been consistently sending HBRequest. Hence this is a 
> deadlock caused by not releasing the file dir locks in task suspension.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5152) Kafka Streams keeps restoring state after shutdown is initiated during startup

2017-07-18 Thread Guozhang Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16092330#comment-16092330
 ] 

Guozhang Wang commented on KAFKA-5152:
--

Here are a couple alternative solutions to resolve this issue:

1. Revert the task-suspension optimization, this is the worst-case solution we 
can get to work around it.

2. Move the restoration process completely out of the {{onPartitionsAssigned}} 
callback, in the thread's main while loop. More specifically the workflow of 
the thread will be:

a) only create the active / standby tasks without executing the restoration of 
states on the active tasks; release the dir file locks in 
{{onPartitionAssigned}} for those suspended-and-not-reassigned tasks. Mark all 
the created active tasks as not-ready first and pause all these task's 
corresponding source topic-partitions.

b) whose state has not been restored up to date (for those 
suspended-and-reassigned tasks double check that the they should not be 
included since their state should be up-to-date)

b) in the main loop:

b.1) check for all the not-ready tasks and see if their stores have completed 
restoration (restored offset = logend offset), if yes mark these tasks as 
active now and resume their corresponding source topic-partitions.

b.2) process / punctuate / commit if possible for active tasks which have 
already some records fetched.

b.3) restore the state for both standby tasks and active-not-ready tasks.

One thing needed for care though, is that since now tasks within the same 
thread can start processing at different time, but for IQ we still need to only 
make the instance / thread as RUNNING after all its tasks are now running.

> Kafka Streams keeps restoring state after shutdown is initiated during startup
> --
>
> Key: KAFKA-5152
> URL: https://issues.apache.org/jira/browse/KAFKA-5152
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.10.2.1
>Reporter: Xavier Léauté
>Assignee: Matthias J. Sax
> Fix For: 0.10.2.2, 0.11.0.1
>
>
> If streams shutdown is initiated during state restore (e.g. an uncaught 
> exception is thrown) streams will not shut down until all stores are first 
> finished restoring.
> As restore progresses, stream threads appear to be taken out of service as 
> part of the shutdown sequence, causing rebalancing of tasks. This compounds 
> the problem by slowing down the restore process even further, since the 
> remaining threads now have to also restore the reassigned tasks before they 
> can shut down.
> A more severe issue is that if there is a new rebalance triggered during the 
> end of the waitingSync phase (e.g. due to a new member joining the group, or 
> some members timed out the SyncGroup response), then some consumer clients of 
> the group may already proceed with the {{onPartitionsAssigned}} and blocked 
> on trying to grab the file dir lock not yet released from other clients, 
> while the other clients holding the lock are consistently re-sending 
> {{JoinGroup}} requests while the rebalance cannot be completed because the 
> clients blocked on the file dir lock will not be kicked out of the group as 
> its heartbeat thread has been consistently sending HBRequest. Hence this is a 
> deadlock caused by not releasing the file dir locks in task suspension.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)