[ https://issues.apache.org/jira/browse/KAFKA-7203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Matthias J. Sax resolved KAFKA-7203. ------------------------------------ Resolution: Fixed > Improve Streams StickyTaskAssingor > ---------------------------------- > > Key: KAFKA-7203 > URL: https://issues.apache.org/jira/browse/KAFKA-7203 > Project: Kafka > Issue Type: Improvement > Components: streams > Reporter: Guozhang Wang > Priority: Major > > This is a inspired discussion while trying to fix KAFKA-7144. > Currently we are not striking a very good trade-off sweet point between > stickiness and workload balance: we are honoring the former more than the > latter. One idea to improve on this is the following: > {code} > I'd like to propose a slightly different approach to fix 7114 while making > no-worse tradeoffs between stickiness and sub-topology balance. The key idea > is to try to adjust the assignment to gets the distribution as closer as to > the sub-topologies' num.tasks distribution. > Here is a detailed workflow: > 1. at the beginning, we first calculate for each client C, how many tasks > should it be assigned ideally, as num.total_tasks / num.total_capacity * > C_capacity rounded down, call it C_a. Note that since we round down this > number, the summing C_a across all C would be <= num.total_tasks, but this > does not matter. > 2. and then for each client C, based on its num. previous assigned tasks C_p, > we calculate how many tasks it should take over, or give up as C_a - C_p (if > it is positive, it should take over some, otherwise it should give up some). > Note that because of the round down, when we calculate the C_a - C_p for each > client, we need to make sure that the total number of give ups and total > number of take overs should be equal, some ad-hoc heuristics can be used. > 3. then we calculate the tasks distribution across the sub-topologies as a > whole. For example, if we have three sub-topologies, st0 and st1, and st0 has > 4 total tasks, st1 has 4 total tasks, and st2 has 8 total tasks, then the > distribution between st0, st1 and st2 should be 1:1:2. Let's call it the > global distribution, and note that currently since num.tasks per sub-topology > never change, this distribution should NEVER change. > 4. then for each client that should give up some, we decides which tasks it > should give up so that the remaining tasks distribution is proportional to > the above global distribution. > For example, if a client previously own 4 tasks of st0, no tasks of st1, and > 2 tasks of st2, and now it needs to give up 3 tasks, I should then give up 2 > of st0 and 1 of st1, so that the remaining distribution is closer to 1:1:2. > 5. now we've collected a list of given-up tasks plus the ones that does not > have any prev active assignment (normally operations it should not happen > since all tasks should have been created since day one), we now migrate them > to those who needs to take over some, similarly proportional to the global > distribution. > For example if a client previously own 1 task of st0, and nothing of st1 and > st2, and now it needs to take over 3 tasks, we would try to give it 1 task of > st1 and 2 tasks of st2, so that the resulted distribution becomes 1:1:2. And > we ONLY consider prev-standby tasks when we decide which one of st1 / st2 > should we get for that client. > Now, consider the following scenarios: > a) this is a clean start and there is no prev-assignment at all, step 4 would > be a no-op; the result should still be fine. > b) a client leaves the group, no client needs to give up and all clients may > need to take over some, so step 4 is no-op, and the cumulated step 5 only > contains the tasks of the left client. > c) a new client joins the group, all clients need to give up some, and only > the new client need to take over all the given-up ones. Hence step 5 is > straight-forward. > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)