[jira] [Commented] (FLINK-9480) Let local recovery support rescaling
[ https://issues.apache.org/jira/browse/FLINK-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17336625#comment-17336625 ] Flink Jira Bot commented on FLINK-9480: --- This issue was labeled "stale-major" 7 ago and has not received any updates so it is being deprioritized. If this ticket is actually Major, please raise the priority and ask a committer to assign you the issue or revive the public discussion. > Let local recovery support rescaling > > > Key: FLINK-9480 > URL: https://issues.apache.org/jira/browse/FLINK-9480 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.5.0 >Reporter: Sihua Zhou >Priority: Major > Labels: stale-major > > Currently, local recovery only support restore from checkpoint and without > rescaling. Maybe we should enable it to support rescaling. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-9480) Let local recovery support rescaling
[ https://issues.apache.org/jira/browse/FLINK-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17328624#comment-17328624 ] Flink Jira Bot commented on FLINK-9480: --- This major issue is unassigned and itself and all of its Sub-Tasks have not been updated for 30 days. So, it has been labeled "stale-major". If this ticket is indeed "major", please either assign yourself or give an update. Afterwards, please remove the label. In 7 days the issue will be deprioritized. > Let local recovery support rescaling > > > Key: FLINK-9480 > URL: https://issues.apache.org/jira/browse/FLINK-9480 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.5.0 >Reporter: Sihua Zhou >Priority: Major > Labels: stale-major > > Currently, local recovery only support restore from checkpoint and without > rescaling. Maybe we should enable it to support rescaling. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-9480) Let local recovery support rescaling
[ https://issues.apache.org/jira/browse/FLINK-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495140#comment-16495140 ] Till Rohrmann commented on FLINK-9480: -- I can totally see the benefits of speeding up rescaling operation by first trying to read local state and then falling back to remote state. In the first iteration, it could be a best effort approach as suggested by [~sihuazhou]. Next we could try to make the scheduling a bit smarter and eventually it could mean that we first load the required state to a TM before deploying tasks. I also agree with you two about the priorities wrt rescalable timers and state ttl. > Let local recovery support rescaling > > > Key: FLINK-9480 > URL: https://issues.apache.org/jira/browse/FLINK-9480 > Project: Flink > Issue Type: Improvement > Components: State Backends, Checkpointing >Affects Versions: 1.5.0 >Reporter: Sihua Zhou >Priority: Major > > Currently, local recovery only support restore from checkpoint and without > rescaling. Maybe we should enable it to support rescaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9480) Let local recovery support rescaling
[ https://issues.apache.org/jira/browse/FLINK-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494908#comment-16494908 ] Sihua Zhou commented on FLINK-9480: --- [~srichter] Thanks for your reply, the reason and the use case I want to improve this is because of the the online rescaling feature of 1.5. Currently, it works as follow: - trigger a savepoint - rescaling from the savepoint. In order to let the online rescaling take advantage of local recovery, we need the local recovery to support rescaling, maybe it's not so strict that all node can only restore locally, but just a best effect, if some node can't find the local state it still can load data from remote. Yes, I agree that this feature's priority is lower than "timer service" and "ttl state" and I just create it in case that we may want to do it in the future... > Let local recovery support rescaling > > > Key: FLINK-9480 > URL: https://issues.apache.org/jira/browse/FLINK-9480 > Project: Flink > Issue Type: Improvement > Components: State Backends, Checkpointing >Affects Versions: 1.5.0 >Reporter: Sihua Zhou >Priority: Major > > Currently, local recovery only support restore from checkpoint and without > rescaling. Maybe we should enable it to support rescaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9480) Let local recovery support rescaling
[ https://issues.apache.org/jira/browse/FLINK-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494897#comment-16494897 ] Stefan Richter commented on FLINK-9480: --- Maybe [~StephanEwen] or [~till.rohrmann] can also give their view on the priority this topic? > Let local recovery support rescaling > > > Key: FLINK-9480 > URL: https://issues.apache.org/jira/browse/FLINK-9480 > Project: Flink > Issue Type: Improvement > Components: State Backends, Checkpointing >Affects Versions: 1.5.0 >Reporter: Sihua Zhou >Priority: Major > > Currently, local recovery only support restore from checkpoint and without > rescaling. Maybe we should enable it to support rescaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9480) Let local recovery support rescaling
[ https://issues.apache.org/jira/browse/FLINK-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494892#comment-16494892 ] Stefan Richter commented on FLINK-9480: --- Can you give some more details why you think this is useful or important or which use case you want to improve? The main goal of local recovery was to have a faster recovery under failures, and recovery does not give you any opportunity to rescale, so we are talking about restarts from local state and this makes things already a bit tricky. For local recovery, you need to know about your previous scheduling. The information about your previous scheduling might get lost when the job is stopped and the JM goes away. So we would need to persist that, e.g. in Zookeeper. Even then you can still run into the problem that the previous locations are already occupied by another job in the meantime, and also when can you finally let go of the local state for this approach? Or are we talking about some form of rescaling that does not terminate the previous job / JM? I want to make aware that this could complicate things quiet a bit. In this context, we can also think about replicating state to pre-warm node or have more alternatives with local state in case a node goes down. But that is also a new feature by itself. Bottom line is, personally, I currently still see many features (timer service, ttl state,...) that I would consider to have a higher priority, but eventually we can surely think about improved rescaling and/or replication. > Let local recovery support rescaling > > > Key: FLINK-9480 > URL: https://issues.apache.org/jira/browse/FLINK-9480 > Project: Flink > Issue Type: Improvement > Components: State Backends, Checkpointing >Affects Versions: 1.5.0 >Reporter: Sihua Zhou >Priority: Major > > Currently, local recovery only support restore from checkpoint and without > rescaling. Maybe we should enable it to support rescaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9480) Let local recovery support rescaling
[ https://issues.apache.org/jira/browse/FLINK-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494880#comment-16494880 ] Sihua Zhou commented on FLINK-9480: --- [~stefanrichte...@gmail.com] What do you think of this? > Let local recovery support rescaling > > > Key: FLINK-9480 > URL: https://issues.apache.org/jira/browse/FLINK-9480 > Project: Flink > Issue Type: Improvement >Reporter: Sihua Zhou >Priority: Major > > Currently, local recovery only support restore from checkpoint and without > rescaling. Maybe we should enable it to support rescaling. -- This message was sent by Atlassian JIRA (v7.6.3#76005)