Re: [VOTE] KIP-869: Improve Streams State Restoration Visibility
Thanks for all the very helpful discussions, I'm closing the vote with a tally here: +1: 7 (Nick, John, Walker, Bruno, Lucas, Matthias, Guozhang), with 5 binding votes and 2 non-binding votes. -1: 0 Guozhang On Wed, Jan 25, 2023 at 5:48 PM Matthias J. Sax wrote: > > Thanks! > > +1 (binding) > > -Matthias > > On 1/24/23 1:17 PM, Guozhang Wang wrote: > > Hi Matthias: > > > > re "paused" -> "suspended": I got your point now, thanks. Just to > > clarify the two functions are a bit different: "paused" tasks are > > because of the topology being paused, i.e. from KIP-834; whereas > > "suspended" tasks are when a restoring tasks are being removed before > > it completes due to a follow-up rebalance, and this is to distinguish > > with "onRestoreEnd", as described in KAFKA-10575. A suspended task is > > no longer owned by the thread and hence there's no need to measure the > > number of such tasks. > > > > re: "restore-ratio": that's a good point. I like it to function in the > > same way as the "records-rate" metrics. Will update the wiki. > > > > re: making "restore-remaining-records-total" at INFO level: sounds > > good to me too. I will also update the metric name a bit to be more > > specific. > > > > > > > > On Thu, Jan 19, 2023 at 2:35 PM Guozhang Wang > > wrote: > >> > >> Hello Matthias, > >> > >> Thanks for the feedback. I was on vacation for a while. Pardon for the > >> late replies. Please see them inline below > >> > >> On Thu, Dec 1, 2022 at 11:23 PM Matthias J. Sax wrote: > >>> > >>> Seems I am late to the party... Great KIP. Couple of questions from my > >>> side: > >>> > >>> (1) What is the purpose of `standby-updating-tasks`? It seems to be the > >>> same as the number of assigned standby task? Not sure how useful it > >>> would be? > >>> > >> In general, yes, it is the number of assigned standby tasks --- there > >> will be transit times when the assigned standby tasks are not yet > >> being updated but it would not last long --- but we do not yet have a > >> direct gauge to expose this before, and users have to infer this from > >> other indirect metrics. > >> > >>> > >>> > >>> (2) `active-paused-tasks` / `standby-paused-tasks` -- what does "paused" > >>> exactly mean? There was a discussion about renaming the callback method > >>> from pause to suspended. So should this be called `suspended`, too? And > >>> if yes, how is it useful for users? > >>> > >> Pausing here refers to "KIP-834: Pause / Resume KafkaStreams > >> Topologies" > >> (https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211882832). > >> When a topology is paused, all its tasks including standbys will be > >> paused too. > >> > >> I'm not aware of a discussion to rename the call name to "suspend" for > >> KIP-834. Could you point me to the reference? > >> > >>> > >>> > >>> (3) `restore-ratio`: the description says > >>> > The fraction of time the thread spent on restoring active or standby > tasks > >>> > >>> I find the term "restoring" does only apply to active tasks, but not to > >>> standbys. Can we reword this? > >>> > >> Yeah I have been discussing this with others in the community a bit as > >> well, but so far I have not been convinced of a better name than it. > >> Some other alternatives being discussed but not win everyone's love is > >> "restore-or-update-ratio", "process-ratio" (for the restore thread > >> that means restoring or updating), and "io-ratio". > >> > >> The only one so far that I feel is probably better, is > >> "state-update-ratio". If folks feel this one is better than > >> "restore-ratio" I'm happy to update. > >> > >>> > >>> (4) `restore-call-rate`: not sure what you exactly mean by "restore > >>> calls"? > >>> > >> This is similar to the "io-calls-rate" in the selector classes, i.e. > >> the number of "restore" function calls made. It's argurably a very > >> low-level metrics but I included it since it could be useful in some > >> debugging scenarios. > >> > >>> > >>> (5) `restore-remaining-records-total` -- why is this a task metric? > >>> Seems we could roll it up into a thread metric that we report at INFO > >>> level (we could still have per-task DEBUG level metric for it in > >>> addition). > >>> > >> The rationale behind it is the general principle in metrics design > >> that "Kafka would provide the lowest necessary metrics levels, and > >> users can do the roll-ups however they want". > >> > >>> > >>> (6) What about "warmup tasks"? Internally, we treat them as standbys, > >>> but it seems it's hard for users to reason about it in the scale-out > >>> warm-up case. Would it be helpful (and possible) to report "warmup > >>> progress" explicitly? > >>> > >> At the restore thread level, we cannot differentiate standby tasks > >> from warmup tasks since the latter is created exactly just like the > >> former. But I do agree this is an issue for visibility that worth > >> addressing, I think another KIP would be needed to first consider > >> distinguishing these
Re: [VOTE] KIP-869: Improve Streams State Restoration Visibility
Thanks! +1 (binding) -Matthias On 1/24/23 1:17 PM, Guozhang Wang wrote: Hi Matthias: re "paused" -> "suspended": I got your point now, thanks. Just to clarify the two functions are a bit different: "paused" tasks are because of the topology being paused, i.e. from KIP-834; whereas "suspended" tasks are when a restoring tasks are being removed before it completes due to a follow-up rebalance, and this is to distinguish with "onRestoreEnd", as described in KAFKA-10575. A suspended task is no longer owned by the thread and hence there's no need to measure the number of such tasks. re: "restore-ratio": that's a good point. I like it to function in the same way as the "records-rate" metrics. Will update the wiki. re: making "restore-remaining-records-total" at INFO level: sounds good to me too. I will also update the metric name a bit to be more specific. On Thu, Jan 19, 2023 at 2:35 PM Guozhang Wang wrote: Hello Matthias, Thanks for the feedback. I was on vacation for a while. Pardon for the late replies. Please see them inline below On Thu, Dec 1, 2022 at 11:23 PM Matthias J. Sax wrote: Seems I am late to the party... Great KIP. Couple of questions from my side: (1) What is the purpose of `standby-updating-tasks`? It seems to be the same as the number of assigned standby task? Not sure how useful it would be? In general, yes, it is the number of assigned standby tasks --- there will be transit times when the assigned standby tasks are not yet being updated but it would not last long --- but we do not yet have a direct gauge to expose this before, and users have to infer this from other indirect metrics. (2) `active-paused-tasks` / `standby-paused-tasks` -- what does "paused" exactly mean? There was a discussion about renaming the callback method from pause to suspended. So should this be called `suspended`, too? And if yes, how is it useful for users? Pausing here refers to "KIP-834: Pause / Resume KafkaStreams Topologies" (https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211882832). When a topology is paused, all its tasks including standbys will be paused too. I'm not aware of a discussion to rename the call name to "suspend" for KIP-834. Could you point me to the reference? (3) `restore-ratio`: the description says The fraction of time the thread spent on restoring active or standby tasks I find the term "restoring" does only apply to active tasks, but not to standbys. Can we reword this? Yeah I have been discussing this with others in the community a bit as well, but so far I have not been convinced of a better name than it. Some other alternatives being discussed but not win everyone's love is "restore-or-update-ratio", "process-ratio" (for the restore thread that means restoring or updating), and "io-ratio". The only one so far that I feel is probably better, is "state-update-ratio". If folks feel this one is better than "restore-ratio" I'm happy to update. (4) `restore-call-rate`: not sure what you exactly mean by "restore calls"? This is similar to the "io-calls-rate" in the selector classes, i.e. the number of "restore" function calls made. It's argurably a very low-level metrics but I included it since it could be useful in some debugging scenarios. (5) `restore-remaining-records-total` -- why is this a task metric? Seems we could roll it up into a thread metric that we report at INFO level (we could still have per-task DEBUG level metric for it in addition). The rationale behind it is the general principle in metrics design that "Kafka would provide the lowest necessary metrics levels, and users can do the roll-ups however they want". (6) What about "warmup tasks"? Internally, we treat them as standbys, but it seems it's hard for users to reason about it in the scale-out warm-up case. Would it be helpful (and possible) to report "warmup progress" explicitly? At the restore thread level, we cannot differentiate standby tasks from warmup tasks since the latter is created exactly just like the former. But I do agree this is an issue for visibility that worth addressing, I think another KIP would be needed to first consider distinguishing these two at the class level. -Matthias On 11/1/22 2:44 AM, Lucas Brutschy wrote: We need this! + 1 non binding Cheers, Lucas On Tue, Nov 1, 2022 at 10:01 AM Bruno Cadonna wrote: Guozhang, Thanks for the KIP! +1 (binding) Best, Bruno On 25.10.22 22:07, Walker Carlson wrote: +1 non binding Thanks for the kip! On Thu, Oct 20, 2022 at 10:25 PM John Roesler wrote: Thanks for the KIP, Guozhang! I'm +1 (binding) -John On Wed, Oct 12, 2022, at 16:36, Nick Telford wrote: Can't wait! +1 (non-binding) On Wed, 12 Oct 2022, 18:02 Guozhang Wang, wrote: Hello all, I'd like to start a vote for the following KIP, aiming to improve Kafka Stream's restoration visibility via new metrics and callback methods:
Re: [VOTE] KIP-869: Improve Streams State Restoration Visibility
Hi Matthias: re "paused" -> "suspended": I got your point now, thanks. Just to clarify the two functions are a bit different: "paused" tasks are because of the topology being paused, i.e. from KIP-834; whereas "suspended" tasks are when a restoring tasks are being removed before it completes due to a follow-up rebalance, and this is to distinguish with "onRestoreEnd", as described in KAFKA-10575. A suspended task is no longer owned by the thread and hence there's no need to measure the number of such tasks. re: "restore-ratio": that's a good point. I like it to function in the same way as the "records-rate" metrics. Will update the wiki. re: making "restore-remaining-records-total" at INFO level: sounds good to me too. I will also update the metric name a bit to be more specific. On Thu, Jan 19, 2023 at 2:35 PM Guozhang Wang wrote: > > Hello Matthias, > > Thanks for the feedback. I was on vacation for a while. Pardon for the > late replies. Please see them inline below > > On Thu, Dec 1, 2022 at 11:23 PM Matthias J. Sax wrote: > > > > Seems I am late to the party... Great KIP. Couple of questions from my side: > > > > (1) What is the purpose of `standby-updating-tasks`? It seems to be the > > same as the number of assigned standby task? Not sure how useful it > > would be? > > > In general, yes, it is the number of assigned standby tasks --- there > will be transit times when the assigned standby tasks are not yet > being updated but it would not last long --- but we do not yet have a > direct gauge to expose this before, and users have to infer this from > other indirect metrics. > > > > > > > (2) `active-paused-tasks` / `standby-paused-tasks` -- what does "paused" > > exactly mean? There was a discussion about renaming the callback method > > from pause to suspended. So should this be called `suspended`, too? And > > if yes, how is it useful for users? > > > Pausing here refers to "KIP-834: Pause / Resume KafkaStreams > Topologies" > (https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211882832). > When a topology is paused, all its tasks including standbys will be > paused too. > > I'm not aware of a discussion to rename the call name to "suspend" for > KIP-834. Could you point me to the reference? > > > > > > > (3) `restore-ratio`: the description says > > > > > The fraction of time the thread spent on restoring active or standby tasks > > > > I find the term "restoring" does only apply to active tasks, but not to > > standbys. Can we reword this? > > > Yeah I have been discussing this with others in the community a bit as > well, but so far I have not been convinced of a better name than it. > Some other alternatives being discussed but not win everyone's love is > "restore-or-update-ratio", "process-ratio" (for the restore thread > that means restoring or updating), and "io-ratio". > > The only one so far that I feel is probably better, is > "state-update-ratio". If folks feel this one is better than > "restore-ratio" I'm happy to update. > > > > > (4) `restore-call-rate`: not sure what you exactly mean by "restore calls"? > > > This is similar to the "io-calls-rate" in the selector classes, i.e. > the number of "restore" function calls made. It's argurably a very > low-level metrics but I included it since it could be useful in some > debugging scenarios. > > > > > (5) `restore-remaining-records-total` -- why is this a task metric? > > Seems we could roll it up into a thread metric that we report at INFO > > level (we could still have per-task DEBUG level metric for it in addition). > > > The rationale behind it is the general principle in metrics design > that "Kafka would provide the lowest necessary metrics levels, and > users can do the roll-ups however they want". > > > > > (6) What about "warmup tasks"? Internally, we treat them as standbys, > > but it seems it's hard for users to reason about it in the scale-out > > warm-up case. Would it be helpful (and possible) to report "warmup > > progress" explicitly? > > > At the restore thread level, we cannot differentiate standby tasks > from warmup tasks since the latter is created exactly just like the > former. But I do agree this is an issue for visibility that worth > addressing, I think another KIP would be needed to first consider > distinguishing these two at the class level. > > > > > -Matthias > > > > > > On 11/1/22 2:44 AM, Lucas Brutschy wrote: > > > We need this! > > > > > > + 1 non binding > > > > > > Cheers, > > > Lucas > > > > > > On Tue, Nov 1, 2022 at 10:01 AM Bruno Cadonna wrote: > > >> > > >> Guozhang, > > >> > > >> Thanks for the KIP! > > >> > > >> +1 (binding) > > >> > > >> Best, > > >> Bruno > > >> > > >> On 25.10.22 22:07, Walker Carlson wrote: > > >>> +1 non binding > > >>> > > >>> Thanks for the kip! > > >>> > > >>> On Thu, Oct 20, 2022 at 10:25 PM John Roesler > > >>> wrote: > > >>> > > Thanks for the KIP, Guozhang! > > > > I'm +1 (binding) > > > > -John > > > >
Re: [VOTE] KIP-869: Improve Streams State Restoration Visibility
Hey, I'm changing my vote to binding now :) On Mon, Jan 23, 2023 at 9:38 PM Matthias J. Sax wrote: > Thanks Guozhang. Couple of clarifications and follow up questions. > > > >> I'm not aware of a discussion to rename the call name to "suspend" for > >> KIP-834. Could you point me to the reference? > > My commend was not about KIP-834, but about this KIP. You originally > proposed to call the new call-back `onRestorePause` but to avoid > confusion it was improved and renamed to `onRestoreSuspended`. > > > > The only one so far that I feel is probably better, is > > "state-update-ratio". If folks feel this one is better than > > "restore-ratio" I'm happy to update. > > Could we actually report two metric, one for the restore phase > (restore-ration), and one for steady state ([standby-]update-ratio)? > > I could like with `state-update-ratio` if we want to have a single > metric for both, but splitting them sound useful to me. > > > > (4) `restore-call-rate` > > Maybe we can clarify in the description a little bit. I agree it's very > low level but if you think it could be useful to debugging, I have no > objection. > > > > The rationale behind it is the general principle in metrics design > > that "Kafka would provide the lowest necessary metrics levels, and > > users can do the roll-ups however they want". > > That's fair, but it seems to be a rather important metric, and having it > only at DEBUG level seems not ideal? Could we make it INFO level, even > if it's a task level (ie, apply an exception to the rule). > > > > -Matthias > > > > On 1/19/23 2:35 PM, Guozhang Wang wrote: > > Hello Matthias, > > > > Thanks for the feedback. I was on vacation for a while. Pardon for the > > late replies. Please see them inline below > > > > On Thu, Dec 1, 2022 at 11:23 PM Matthias J. Sax > wrote: > >> > >> Seems I am late to the party... Great KIP. Couple of questions from my > side: > >> > >> (1) What is the purpose of `standby-updating-tasks`? It seems to be the > >> same as the number of assigned standby task? Not sure how useful it > >> would be? > >> > > In general, yes, it is the number of assigned standby tasks --- there > > will be transit times when the assigned standby tasks are not yet > > being updated but it would not last long --- but we do not yet have a > > direct gauge to expose this before, and users have to infer this from > > other indirect metrics. > > > >> > >> > >> (2) `active-paused-tasks` / `standby-paused-tasks` -- what does "paused" > >> exactly mean? There was a discussion about renaming the callback method > >> from pause to suspended. So should this be called `suspended`, too? And > >> if yes, how is it useful for users? > >> > > Pausing here refers to "KIP-834: Pause / Resume KafkaStreams > > Topologies" ( > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211882832 > ). > > When a topology is paused, all its tasks including standbys will be > > paused too. > > > > I'm not aware of a discussion to rename the call name to "suspend" for > > KIP-834. Could you point me to the reference? > > > >> > >> > >> (3) `restore-ratio`: the description says > >> > >>> The fraction of time the thread spent on restoring active or standby > tasks > >> > >> I find the term "restoring" does only apply to active tasks, but not to > >> standbys. Can we reword this? > >> > > Yeah I have been discussing this with others in the community a bit as > > well, but so far I have not been convinced of a better name than it. > > Some other alternatives being discussed but not win everyone's love is > > "restore-or-update-ratio", "process-ratio" (for the restore thread > > that means restoring or updating), and "io-ratio". > > > > The only one so far that I feel is probably better, is > > "state-update-ratio". If folks feel this one is better than > > "restore-ratio" I'm happy to update. > > > >> > >> (4) `restore-call-rate`: not sure what you exactly mean by "restore > calls"? > >> > > This is similar to the "io-calls-rate" in the selector classes, i.e. > > the number of "restore" function calls made. It's argurably a very > > low-level metrics but I included it since it could be useful in some > > debugging scenarios. > > > >> > >> (5) `restore-remaining-records-total` -- why is this a task metric? > >> Seems we could roll it up into a thread metric that we report at INFO > >> level (we could still have per-task DEBUG level metric for it in > addition). > >> > > The rationale behind it is the general principle in metrics design > > that "Kafka would provide the lowest necessary metrics levels, and > > users can do the roll-ups however they want". > > > >> > >> (6) What about "warmup tasks"? Internally, we treat them as standbys, > >> but it seems it's hard for users to reason about it in the scale-out > >> warm-up case. Would it be helpful (and possible) to report "warmup > >> progress" explicitly? > >> > > At the restore thread level, we cannot differentiate standby tasks > > from warmup tasks
Re: [VOTE] KIP-869: Improve Streams State Restoration Visibility
Thanks Guozhang. Couple of clarifications and follow up questions. I'm not aware of a discussion to rename the call name to "suspend" for KIP-834. Could you point me to the reference? My commend was not about KIP-834, but about this KIP. You originally proposed to call the new call-back `onRestorePause` but to avoid confusion it was improved and renamed to `onRestoreSuspended`. The only one so far that I feel is probably better, is "state-update-ratio". If folks feel this one is better than "restore-ratio" I'm happy to update. Could we actually report two metric, one for the restore phase (restore-ration), and one for steady state ([standby-]update-ratio)? I could like with `state-update-ratio` if we want to have a single metric for both, but splitting them sound useful to me. (4) `restore-call-rate` Maybe we can clarify in the description a little bit. I agree it's very low level but if you think it could be useful to debugging, I have no objection. The rationale behind it is the general principle in metrics design that "Kafka would provide the lowest necessary metrics levels, and users can do the roll-ups however they want". That's fair, but it seems to be a rather important metric, and having it only at DEBUG level seems not ideal? Could we make it INFO level, even if it's a task level (ie, apply an exception to the rule). -Matthias On 1/19/23 2:35 PM, Guozhang Wang wrote: Hello Matthias, Thanks for the feedback. I was on vacation for a while. Pardon for the late replies. Please see them inline below On Thu, Dec 1, 2022 at 11:23 PM Matthias J. Sax wrote: Seems I am late to the party... Great KIP. Couple of questions from my side: (1) What is the purpose of `standby-updating-tasks`? It seems to be the same as the number of assigned standby task? Not sure how useful it would be? In general, yes, it is the number of assigned standby tasks --- there will be transit times when the assigned standby tasks are not yet being updated but it would not last long --- but we do not yet have a direct gauge to expose this before, and users have to infer this from other indirect metrics. (2) `active-paused-tasks` / `standby-paused-tasks` -- what does "paused" exactly mean? There was a discussion about renaming the callback method from pause to suspended. So should this be called `suspended`, too? And if yes, how is it useful for users? Pausing here refers to "KIP-834: Pause / Resume KafkaStreams Topologies" (https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211882832). When a topology is paused, all its tasks including standbys will be paused too. I'm not aware of a discussion to rename the call name to "suspend" for KIP-834. Could you point me to the reference? (3) `restore-ratio`: the description says The fraction of time the thread spent on restoring active or standby tasks I find the term "restoring" does only apply to active tasks, but not to standbys. Can we reword this? Yeah I have been discussing this with others in the community a bit as well, but so far I have not been convinced of a better name than it. Some other alternatives being discussed but not win everyone's love is "restore-or-update-ratio", "process-ratio" (for the restore thread that means restoring or updating), and "io-ratio". The only one so far that I feel is probably better, is "state-update-ratio". If folks feel this one is better than "restore-ratio" I'm happy to update. (4) `restore-call-rate`: not sure what you exactly mean by "restore calls"? This is similar to the "io-calls-rate" in the selector classes, i.e. the number of "restore" function calls made. It's argurably a very low-level metrics but I included it since it could be useful in some debugging scenarios. (5) `restore-remaining-records-total` -- why is this a task metric? Seems we could roll it up into a thread metric that we report at INFO level (we could still have per-task DEBUG level metric for it in addition). The rationale behind it is the general principle in metrics design that "Kafka would provide the lowest necessary metrics levels, and users can do the roll-ups however they want". (6) What about "warmup tasks"? Internally, we treat them as standbys, but it seems it's hard for users to reason about it in the scale-out warm-up case. Would it be helpful (and possible) to report "warmup progress" explicitly? At the restore thread level, we cannot differentiate standby tasks from warmup tasks since the latter is created exactly just like the former. But I do agree this is an issue for visibility that worth addressing, I think another KIP would be needed to first consider distinguishing these two at the class level. -Matthias On 11/1/22 2:44 AM, Lucas Brutschy wrote: We need this! + 1 non binding Cheers, Lucas On Tue, Nov 1, 2022 at 10:01 AM Bruno Cadonna wrote: Guozhang, Thanks for the KIP! +1 (binding) Best, Bruno On 25.10.22 22:07, Walker Carlson wrote: +1
Re: [VOTE] KIP-869: Improve Streams State Restoration Visibility
Hello Matthias, Thanks for the feedback. I was on vacation for a while. Pardon for the late replies. Please see them inline below On Thu, Dec 1, 2022 at 11:23 PM Matthias J. Sax wrote: > > Seems I am late to the party... Great KIP. Couple of questions from my side: > > (1) What is the purpose of `standby-updating-tasks`? It seems to be the > same as the number of assigned standby task? Not sure how useful it > would be? > In general, yes, it is the number of assigned standby tasks --- there will be transit times when the assigned standby tasks are not yet being updated but it would not last long --- but we do not yet have a direct gauge to expose this before, and users have to infer this from other indirect metrics. > > > (2) `active-paused-tasks` / `standby-paused-tasks` -- what does "paused" > exactly mean? There was a discussion about renaming the callback method > from pause to suspended. So should this be called `suspended`, too? And > if yes, how is it useful for users? > Pausing here refers to "KIP-834: Pause / Resume KafkaStreams Topologies" (https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211882832). When a topology is paused, all its tasks including standbys will be paused too. I'm not aware of a discussion to rename the call name to "suspend" for KIP-834. Could you point me to the reference? > > > (3) `restore-ratio`: the description says > > > The fraction of time the thread spent on restoring active or standby tasks > > I find the term "restoring" does only apply to active tasks, but not to > standbys. Can we reword this? > Yeah I have been discussing this with others in the community a bit as well, but so far I have not been convinced of a better name than it. Some other alternatives being discussed but not win everyone's love is "restore-or-update-ratio", "process-ratio" (for the restore thread that means restoring or updating), and "io-ratio". The only one so far that I feel is probably better, is "state-update-ratio". If folks feel this one is better than "restore-ratio" I'm happy to update. > > (4) `restore-call-rate`: not sure what you exactly mean by "restore calls"? > This is similar to the "io-calls-rate" in the selector classes, i.e. the number of "restore" function calls made. It's argurably a very low-level metrics but I included it since it could be useful in some debugging scenarios. > > (5) `restore-remaining-records-total` -- why is this a task metric? > Seems we could roll it up into a thread metric that we report at INFO > level (we could still have per-task DEBUG level metric for it in addition). > The rationale behind it is the general principle in metrics design that "Kafka would provide the lowest necessary metrics levels, and users can do the roll-ups however they want". > > (6) What about "warmup tasks"? Internally, we treat them as standbys, > but it seems it's hard for users to reason about it in the scale-out > warm-up case. Would it be helpful (and possible) to report "warmup > progress" explicitly? > At the restore thread level, we cannot differentiate standby tasks from warmup tasks since the latter is created exactly just like the former. But I do agree this is an issue for visibility that worth addressing, I think another KIP would be needed to first consider distinguishing these two at the class level. > > -Matthias > > > On 11/1/22 2:44 AM, Lucas Brutschy wrote: > > We need this! > > > > + 1 non binding > > > > Cheers, > > Lucas > > > > On Tue, Nov 1, 2022 at 10:01 AM Bruno Cadonna wrote: > >> > >> Guozhang, > >> > >> Thanks for the KIP! > >> > >> +1 (binding) > >> > >> Best, > >> Bruno > >> > >> On 25.10.22 22:07, Walker Carlson wrote: > >>> +1 non binding > >>> > >>> Thanks for the kip! > >>> > >>> On Thu, Oct 20, 2022 at 10:25 PM John Roesler wrote: > >>> > Thanks for the KIP, Guozhang! > > I'm +1 (binding) > > -John > > On Wed, Oct 12, 2022, at 16:36, Nick Telford wrote: > > Can't wait! > > +1 (non-binding) > > > > On Wed, 12 Oct 2022, 18:02 Guozhang Wang, > > wrote: > > > >> Hello all, > >> > >> I'd like to start a vote for the following KIP, aiming to improve Kafka > >> Stream's restoration visibility via new metrics and callback methods: > >> > >> > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility > >> > >> > >> Thanks! > >> -- Guozhang > >> > > >>>
Re: [VOTE] KIP-869: Improve Streams State Restoration Visibility
Seems I am late to the party... Great KIP. Couple of questions from my side: (1) What is the purpose of `standby-updating-tasks`? It seems to be the same as the number of assigned standby task? Not sure how useful it would be? (2) `active-paused-tasks` / `standby-paused-tasks` -- what does "paused" exactly mean? There was a discussion about renaming the callback method from pause to suspended. So should this be called `suspended`, too? And if yes, how is it useful for users? (3) `restore-ratio`: the description says The fraction of time the thread spent on restoring active or standby tasks I find the term "restoring" does only apply to active tasks, but not to standbys. Can we reword this? (4) `restore-call-rate`: not sure what you exactly mean by "restore calls"? (5) `restore-remaining-records-total` -- why is this a task metric? Seems we could roll it up into a thread metric that we report at INFO level (we could still have per-task DEBUG level metric for it in addition). (6) What about "warmup tasks"? Internally, we treat them as standbys, but it seems it's hard for users to reason about it in the scale-out warm-up case. Would it be helpful (and possible) to report "warmup progress" explicitly? -Matthias On 11/1/22 2:44 AM, Lucas Brutschy wrote: We need this! + 1 non binding Cheers, Lucas On Tue, Nov 1, 2022 at 10:01 AM Bruno Cadonna wrote: Guozhang, Thanks for the KIP! +1 (binding) Best, Bruno On 25.10.22 22:07, Walker Carlson wrote: +1 non binding Thanks for the kip! On Thu, Oct 20, 2022 at 10:25 PM John Roesler wrote: Thanks for the KIP, Guozhang! I'm +1 (binding) -John On Wed, Oct 12, 2022, at 16:36, Nick Telford wrote: Can't wait! +1 (non-binding) On Wed, 12 Oct 2022, 18:02 Guozhang Wang, wrote: Hello all, I'd like to start a vote for the following KIP, aiming to improve Kafka Stream's restoration visibility via new metrics and callback methods: https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility Thanks! -- Guozhang
Re: [VOTE] KIP-869: Improve Streams State Restoration Visibility
We need this! + 1 non binding Cheers, Lucas On Tue, Nov 1, 2022 at 10:01 AM Bruno Cadonna wrote: > > Guozhang, > > Thanks for the KIP! > > +1 (binding) > > Best, > Bruno > > On 25.10.22 22:07, Walker Carlson wrote: > > +1 non binding > > > > Thanks for the kip! > > > > On Thu, Oct 20, 2022 at 10:25 PM John Roesler wrote: > > > >> Thanks for the KIP, Guozhang! > >> > >> I'm +1 (binding) > >> > >> -John > >> > >> On Wed, Oct 12, 2022, at 16:36, Nick Telford wrote: > >>> Can't wait! > >>> +1 (non-binding) > >>> > >>> On Wed, 12 Oct 2022, 18:02 Guozhang Wang, > >>> wrote: > >>> > Hello all, > > I'd like to start a vote for the following KIP, aiming to improve Kafka > Stream's restoration visibility via new metrics and callback methods: > > > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility > > > Thanks! > -- Guozhang > > >> > >
Re: [VOTE] KIP-869: Improve Streams State Restoration Visibility
Guozhang, Thanks for the KIP! +1 (binding) Best, Bruno On 25.10.22 22:07, Walker Carlson wrote: +1 non binding Thanks for the kip! On Thu, Oct 20, 2022 at 10:25 PM John Roesler wrote: Thanks for the KIP, Guozhang! I'm +1 (binding) -John On Wed, Oct 12, 2022, at 16:36, Nick Telford wrote: Can't wait! +1 (non-binding) On Wed, 12 Oct 2022, 18:02 Guozhang Wang, wrote: Hello all, I'd like to start a vote for the following KIP, aiming to improve Kafka Stream's restoration visibility via new metrics and callback methods: https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility Thanks! -- Guozhang
Re: [VOTE] KIP-869: Improve Streams State Restoration Visibility
+1 non binding Thanks for the kip! On Thu, Oct 20, 2022 at 10:25 PM John Roesler wrote: > Thanks for the KIP, Guozhang! > > I'm +1 (binding) > > -John > > On Wed, Oct 12, 2022, at 16:36, Nick Telford wrote: > > Can't wait! > > +1 (non-binding) > > > > On Wed, 12 Oct 2022, 18:02 Guozhang Wang, > > wrote: > > > >> Hello all, > >> > >> I'd like to start a vote for the following KIP, aiming to improve Kafka > >> Stream's restoration visibility via new metrics and callback methods: > >> > >> > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility > >> > >> > >> Thanks! > >> -- Guozhang > >> >
Re: [VOTE] KIP-869: Improve Streams State Restoration Visibility
Thanks for the KIP, Guozhang! I'm +1 (binding) -John On Wed, Oct 12, 2022, at 16:36, Nick Telford wrote: > Can't wait! > +1 (non-binding) > > On Wed, 12 Oct 2022, 18:02 Guozhang Wang, > wrote: > >> Hello all, >> >> I'd like to start a vote for the following KIP, aiming to improve Kafka >> Stream's restoration visibility via new metrics and callback methods: >> >> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility >> >> >> Thanks! >> -- Guozhang >>
Re: [VOTE] KIP-869: Improve Streams State Restoration Visibility
Can't wait! +1 (non-binding) On Wed, 12 Oct 2022, 18:02 Guozhang Wang, wrote: > Hello all, > > I'd like to start a vote for the following KIP, aiming to improve Kafka > Stream's restoration visibility via new metrics and callback methods: > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility > > > Thanks! > -- Guozhang >