[jira] [Commented] (FLINK-9480) Let local recovery support rescaling

2021-04-29 Thread Flink Jira Bot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17336625#comment-17336625
 ] 

Flink Jira Bot commented on FLINK-9480:
---

This issue was labeled "stale-major" 7 ago and has not received any updates so 
it is being deprioritized. If this ticket is actually Major, please raise the 
priority and ask a committer to assign you the issue or revive the public 
discussion.


> Let local recovery support rescaling
> 
>
> Key: FLINK-9480
> URL: https://issues.apache.org/jira/browse/FLINK-9480
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Affects Versions: 1.5.0
>Reporter: Sihua Zhou
>Priority: Major
>  Labels: stale-major
>
> Currently, local recovery only support restore from checkpoint and without 
> rescaling. Maybe we should enable it to support rescaling.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-9480) Let local recovery support rescaling

2021-04-22 Thread Flink Jira Bot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17328624#comment-17328624
 ] 

Flink Jira Bot commented on FLINK-9480:
---

This major issue is unassigned and itself and all of its Sub-Tasks have not 
been updated for 30 days. So, it has been labeled "stale-major". If this ticket 
is indeed "major", please either assign yourself or give an update. Afterwards, 
please remove the label. In 7 days the issue will be deprioritized.

> Let local recovery support rescaling
> 
>
> Key: FLINK-9480
> URL: https://issues.apache.org/jira/browse/FLINK-9480
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Affects Versions: 1.5.0
>Reporter: Sihua Zhou
>Priority: Major
>  Labels: stale-major
>
> Currently, local recovery only support restore from checkpoint and without 
> rescaling. Maybe we should enable it to support rescaling.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-9480) Let local recovery support rescaling

2018-05-30 Thread Till Rohrmann (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495140#comment-16495140
 ] 

Till Rohrmann commented on FLINK-9480:
--

I can totally see the benefits of speeding up rescaling operation by first 
trying to read local state and then falling back to remote state. In the first 
iteration, it could be a best effort approach as suggested by [~sihuazhou]. 
Next we could try to make the scheduling a bit smarter and eventually it could 
mean that we first load the required state to a TM before deploying tasks.

I also agree with you two about the priorities wrt rescalable timers and state 
ttl.

> Let local recovery support rescaling
> 
>
> Key: FLINK-9480
> URL: https://issues.apache.org/jira/browse/FLINK-9480
> Project: Flink
>  Issue Type: Improvement
>  Components: State Backends, Checkpointing
>Affects Versions: 1.5.0
>Reporter: Sihua Zhou
>Priority: Major
>
> Currently, local recovery only support restore from checkpoint and without 
> rescaling. Maybe we should enable it to support rescaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9480) Let local recovery support rescaling

2018-05-30 Thread Sihua Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494908#comment-16494908
 ] 

Sihua Zhou commented on FLINK-9480:
---

[~srichter] Thanks for your reply, the reason and the use case I want to 
improve this is because of the the online rescaling feature of 1.5. Currently,  
it works as follow:

- trigger a savepoint
- rescaling from the savepoint.

In order to let the online rescaling take advantage of local recovery, we need 
the local recovery to support rescaling, maybe it's not so strict that all node 
can only restore locally, but just a best effect, if some node can't find the 
local state it still can load data from remote.

Yes, I agree that this feature's priority is lower than "timer service" and 
"ttl state" and I just create it in case that we may want to do it in the 
future...

> Let local recovery support rescaling
> 
>
> Key: FLINK-9480
> URL: https://issues.apache.org/jira/browse/FLINK-9480
> Project: Flink
>  Issue Type: Improvement
>  Components: State Backends, Checkpointing
>Affects Versions: 1.5.0
>Reporter: Sihua Zhou
>Priority: Major
>
> Currently, local recovery only support restore from checkpoint and without 
> rescaling. Maybe we should enable it to support rescaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9480) Let local recovery support rescaling

2018-05-30 Thread Stefan Richter (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494897#comment-16494897
 ] 

Stefan Richter commented on FLINK-9480:
---

Maybe [~StephanEwen] or [~till.rohrmann] can also give their view on the 
priority this topic?

> Let local recovery support rescaling
> 
>
> Key: FLINK-9480
> URL: https://issues.apache.org/jira/browse/FLINK-9480
> Project: Flink
>  Issue Type: Improvement
>  Components: State Backends, Checkpointing
>Affects Versions: 1.5.0
>Reporter: Sihua Zhou
>Priority: Major
>
> Currently, local recovery only support restore from checkpoint and without 
> rescaling. Maybe we should enable it to support rescaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9480) Let local recovery support rescaling

2018-05-30 Thread Stefan Richter (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494892#comment-16494892
 ] 

Stefan Richter commented on FLINK-9480:
---

Can you give some more details why you think this is useful or important or 
which use case you want to improve? The main goal of local recovery was to have 
a faster recovery under failures, and recovery does not give you any 
opportunity to rescale, so we are talking about restarts from local state and 
this makes things already a bit tricky. For local recovery, you need to know 
about your previous scheduling. The information about your previous scheduling 
might get lost when the job is stopped and the JM goes away. So we would need 
to persist that, e.g. in Zookeeper. Even then you can still run into the 
problem that the previous locations are already occupied by another job in the 
meantime, and also when can you finally let go of the local state for this 
approach? Or are we talking about some form of rescaling that does not 
terminate the previous job / JM?
I want to make aware that this could complicate things quiet a bit. In this 
context, we can also think about replicating state to pre-warm node or have 
more alternatives with local state in case a node goes down. But that is also a 
new feature by itself.
Bottom line is, personally, I currently still see many features (timer service, 
ttl state,...) that I would consider to have a higher priority, but eventually 
we can surely think about improved rescaling and/or replication.

> Let local recovery support rescaling
> 
>
> Key: FLINK-9480
> URL: https://issues.apache.org/jira/browse/FLINK-9480
> Project: Flink
>  Issue Type: Improvement
>  Components: State Backends, Checkpointing
>Affects Versions: 1.5.0
>Reporter: Sihua Zhou
>Priority: Major
>
> Currently, local recovery only support restore from checkpoint and without 
> rescaling. Maybe we should enable it to support rescaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9480) Let local recovery support rescaling

2018-05-30 Thread Sihua Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494880#comment-16494880
 ] 

Sihua Zhou commented on FLINK-9480:
---

[~stefanrichte...@gmail.com] What do you think of this?

> Let local recovery support rescaling
> 
>
> Key: FLINK-9480
> URL: https://issues.apache.org/jira/browse/FLINK-9480
> Project: Flink
>  Issue Type: Improvement
>Reporter: Sihua Zhou
>Priority: Major
>
> Currently, local recovery only support restore from checkpoint and without 
> rescaling. Maybe we should enable it to support rescaling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)