[jira] [Commented] (FLINK-2929) Recovery of jobs on cluster restarts
[ https://issues.apache.org/jira/browse/FLINK-2929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987374#comment-14987374 ] Till Rohrmann commented on FLINK-2929: -- We could auto generate a random ZNode path for each cluster start. In case of a clean shutdown this path could be removed if not explicitly set to be kept. When starting a new cluster we then could add an option to start with a specific znode path in order to recover from or in case of an upgrade. However, this has the disadvantage that the user would be responsible for cleaning up the state data when it's no longer needed. > Recovery of jobs on cluster restarts > > > Key: FLINK-2929 > URL: https://issues.apache.org/jira/browse/FLINK-2929 > Project: Flink > Issue Type: Improvement >Affects Versions: 0.10 >Reporter: Ufuk Celebi > > Recovery information is stored in ZooKeeper under a static root like > {{/flink}}. In case of a cluster restart without canceling running jobs old > jobs will be recovered from ZooKeeper. > This can be confusing or helpful depending on the use case. > I suspect that the confusing case will be more common. > We can change the default cluster start up (e.g. new YARN session or new > ./start-cluster call) to purge all existing data in ZooKeeper and add a flag > to not do this if needed. > [~trohrm...@apache.org], [~aljoscha], [~StephanEwen] what's your opinion? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2929) Recovery of jobs on cluster restarts
[ https://issues.apache.org/jira/browse/FLINK-2929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987120#comment-14987120 ] Ufuk Celebi commented on FLINK-2929: Another related issue: running multiple Flink HA clusters with the same root ZNode path is also problematic. You can work around this by configuring the root path, but this might be a confusing out of the box behaviour. > Recovery of jobs on cluster restarts > > > Key: FLINK-2929 > URL: https://issues.apache.org/jira/browse/FLINK-2929 > Project: Flink > Issue Type: Improvement >Affects Versions: 0.10 >Reporter: Ufuk Celebi > > Recovery information is stored in ZooKeeper under a static root like > {{/flink}}. In case of a cluster restart without canceling running jobs old > jobs will be recovered from ZooKeeper. > This can be confusing or helpful depending on the use case. > I suspect that the confusing case will be more common. > We can change the default cluster start up (e.g. new YARN session or new > ./start-cluster call) to purge all existing data in ZooKeeper and add a flag > to not do this if needed. > [~trohrm...@apache.org], [~aljoscha], [~StephanEwen] what's your opinion? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2929) Recovery of jobs on cluster restarts
[ https://issues.apache.org/jira/browse/FLINK-2929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985022#comment-14985022 ] Till Rohrmann commented on FLINK-2929: -- Sure, but for me it would make more sense to add the option for the case which is more unlikely and that's probably the upgrading case. > Recovery of jobs on cluster restarts > > > Key: FLINK-2929 > URL: https://issues.apache.org/jira/browse/FLINK-2929 > Project: Flink > Issue Type: Improvement >Affects Versions: 0.10 >Reporter: Ufuk Celebi > > Recovery information is stored in ZooKeeper under a static root like > {{/flink}}. In case of a cluster restart without canceling running jobs old > jobs will be recovered from ZooKeeper. > This can be confusing or helpful depending on the use case. > I suspect that the confusing case will be more common. > We can change the default cluster start up (e.g. new YARN session or new > ./start-cluster call) to purge all existing data in ZooKeeper and add a flag > to not do this if needed. > [~trohrm...@apache.org], [~aljoscha], [~StephanEwen] what's your opinion? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2929) Recovery of jobs on cluster restarts
[ https://issues.apache.org/jira/browse/FLINK-2929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14986310#comment-14986310 ] Stephan Ewen commented on FLINK-2929: - That sounds good, but should be have the option before we add the "purge zookeeper" mode as the default, so that there is always one way for upgrades possible? > Recovery of jobs on cluster restarts > > > Key: FLINK-2929 > URL: https://issues.apache.org/jira/browse/FLINK-2929 > Project: Flink > Issue Type: Improvement >Affects Versions: 0.10 >Reporter: Ufuk Celebi > > Recovery information is stored in ZooKeeper under a static root like > {{/flink}}. In case of a cluster restart without canceling running jobs old > jobs will be recovered from ZooKeeper. > This can be confusing or helpful depending on the use case. > I suspect that the confusing case will be more common. > We can change the default cluster start up (e.g. new YARN session or new > ./start-cluster call) to purge all existing data in ZooKeeper and add a flag > to not do this if needed. > [~trohrm...@apache.org], [~aljoscha], [~StephanEwen] what's your opinion? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2929) Recovery of jobs on cluster restarts
[ https://issues.apache.org/jira/browse/FLINK-2929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976712#comment-14976712 ] Aljoscha Krettek commented on FLINK-2929: - I think we have to fix it, yes. I'm not sure which should be the default behavior though. I gravitate towards making recovery of old jobs the default. But I see how it could be confusing... > Recovery of jobs on cluster restarts > > > Key: FLINK-2929 > URL: https://issues.apache.org/jira/browse/FLINK-2929 > Project: Flink > Issue Type: Improvement >Affects Versions: 0.10 >Reporter: Ufuk Celebi > > Recovery information is stored in ZooKeeper under a static root like > {{/flink}}. In case of a cluster restart without canceling running jobs old > jobs will be recovered from ZooKeeper. > This can be confusing or helpful depending on the use case. > I suspect that the confusing case will be more common. > We can change the default cluster start up (e.g. new YARN session or new > ./start-cluster call) to purge all existing data in ZooKeeper and add a flag > to not do this if needed. > [~trohrm...@apache.org], [~aljoscha], [~StephanEwen] what's your opinion? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2929) Recovery of jobs on cluster restarts
[ https://issues.apache.org/jira/browse/FLINK-2929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976835#comment-14976835 ] Ufuk Celebi commented on FLINK-2929: Maybe you are right. Keeping it as it is (and adding an option to purge) makes sure that jobs are not removed accidentally. And it's possible to cancel an old job after a restart. > Recovery of jobs on cluster restarts > > > Key: FLINK-2929 > URL: https://issues.apache.org/jira/browse/FLINK-2929 > Project: Flink > Issue Type: Improvement >Affects Versions: 0.10 >Reporter: Ufuk Celebi > > Recovery information is stored in ZooKeeper under a static root like > {{/flink}}. In case of a cluster restart without canceling running jobs old > jobs will be recovered from ZooKeeper. > This can be confusing or helpful depending on the use case. > I suspect that the confusing case will be more common. > We can change the default cluster start up (e.g. new YARN session or new > ./start-cluster call) to purge all existing data in ZooKeeper and add a flag > to not do this if needed. > [~trohrm...@apache.org], [~aljoscha], [~StephanEwen] what's your opinion? -- This message was sent by Atlassian JIRA (v6.3.4#6332)