[jira] [Commented] (FLINK-6742) Improve error message when savepoint migration fails due to task removal
[ https://issues.apache.org/jira/browse/FLINK-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16071141#comment-16071141 ] ASF GitHub Bot commented on FLINK-6742: --- Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/4185 > Improve error message when savepoint migration fails due to task removal > > > Key: FLINK-6742 > URL: https://issues.apache.org/jira/browse/FLINK-6742 > Project: Flink > Issue Type: Bug > Components: State Backends, Checkpointing >Affects Versions: 1.3.0 >Reporter: Gyula Fora >Assignee: Chesnay Schepler >Priority: Minor > Labels: flink-rel-1.3.1-blockers > > Caused by: java.lang.NullPointerException > at > org.apache.flink.runtime.checkpoint.savepoint.SavepointV2.convertToOperatorStateSavepointV2(SavepointV2.java:171) > at > org.apache.flink.runtime.checkpoint.savepoint.SavepointLoader.loadAndValidateSavepoint(SavepointLoader.java:75) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1090) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-6742) Improve error message when savepoint migration fails due to task removal
[ https://issues.apache.org/jira/browse/FLINK-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16063193#comment-16063193 ] ASF GitHub Bot commented on FLINK-6742: --- Github user zentol commented on the issue: https://github.com/apache/flink/pull/4185 Thank you for taking a look, merging this. > Improve error message when savepoint migration fails due to task removal > > > Key: FLINK-6742 > URL: https://issues.apache.org/jira/browse/FLINK-6742 > Project: Flink > Issue Type: Bug > Components: State Backends, Checkpointing >Affects Versions: 1.3.0 >Reporter: Gyula Fora >Assignee: Chesnay Schepler >Priority: Minor > Labels: flink-rel-1.3.1-blockers > > Caused by: java.lang.NullPointerException > at > org.apache.flink.runtime.checkpoint.savepoint.SavepointV2.convertToOperatorStateSavepointV2(SavepointV2.java:171) > at > org.apache.flink.runtime.checkpoint.savepoint.SavepointLoader.loadAndValidateSavepoint(SavepointLoader.java:75) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1090) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-6742) Improve error message when savepoint migration fails due to task removal
[ https://issues.apache.org/jira/browse/FLINK-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16062973#comment-16062973 ] ASF GitHub Bot commented on FLINK-6742: --- GitHub user zentol opened a pull request: https://github.com/apache/flink/pull/4185 [FLINK-6742] Add eager checks for parallelism/chain-length change This is a follow-up to #4083 that adds checks to the savepoint migration for any change in parallelism or chain length. Should be merged for 1.3 and 1.4. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zentol/flink 6742_2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/4185.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4185 commit 7f95bc69e4aced1ca1d89c1cbc2067150f5f583b Author: zentolDate: 2017-06-26T11:38:54Z [FLINK-6742] Add eager checks for parallelism/chain-length change > Improve error message when savepoint migration fails due to task removal > > > Key: FLINK-6742 > URL: https://issues.apache.org/jira/browse/FLINK-6742 > Project: Flink > Issue Type: Bug > Components: State Backends, Checkpointing >Affects Versions: 1.3.0 >Reporter: Gyula Fora >Assignee: Chesnay Schepler >Priority: Minor > Labels: flink-rel-1.3.1-blockers > > Caused by: java.lang.NullPointerException > at > org.apache.flink.runtime.checkpoint.savepoint.SavepointV2.convertToOperatorStateSavepointV2(SavepointV2.java:171) > at > org.apache.flink.runtime.checkpoint.savepoint.SavepointLoader.loadAndValidateSavepoint(SavepointLoader.java:75) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1090) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-6742) Improve error message when savepoint migration fails due to task removal
[ https://issues.apache.org/jira/browse/FLINK-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16062868#comment-16062868 ] ASF GitHub Bot commented on FLINK-6742: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/4083#discussion_r123972388 --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/savepoint/SavepointV2.java --- @@ -168,10 +168,27 @@ public static Savepoint convertToOperatorStateSavepointV2( expandedToLegacyIds = true; } + if (jobVertex == null) { + throw new IllegalStateException( + "Could not find task for state with ID " + taskState.getJobVertexID() + ". " + + "When migrating a savepoint from a version < 1.3 please make sure that the topology was not " + + "changed through removal of a stateful operator or modification of a chain containing a stateful " + + "operator."); + } + List operatorIDs = jobVertex.getOperatorIDs(); for (int subtaskIndex = 0; subtaskIndex < jobVertex.getParallelism(); subtaskIndex++) { - SubtaskState subtaskState = taskState.getState(subtaskIndex); + SubtaskState subtaskState; + try { + subtaskState = taskState.getState(subtaskIndex); --- End diff -- yes that's true, I'll create a follow-up PR. > Improve error message when savepoint migration fails due to task removal > > > Key: FLINK-6742 > URL: https://issues.apache.org/jira/browse/FLINK-6742 > Project: Flink > Issue Type: Bug > Components: State Backends, Checkpointing >Affects Versions: 1.3.0 >Reporter: Gyula Fora >Assignee: Chesnay Schepler >Priority: Minor > Labels: flink-rel-1.3.1-blockers > > Caused by: java.lang.NullPointerException > at > org.apache.flink.runtime.checkpoint.savepoint.SavepointV2.convertToOperatorStateSavepointV2(SavepointV2.java:171) > at > org.apache.flink.runtime.checkpoint.savepoint.SavepointLoader.loadAndValidateSavepoint(SavepointLoader.java:75) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1090) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-6742) Improve error message when savepoint migration fails due to task removal
[ https://issues.apache.org/jira/browse/FLINK-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16062364#comment-16062364 ] Chesnay Schepler commented on FLINK-6742: - I merged it to 1.3 after your comment, so it was ok ;) I'll think about your suggestion regarding the parallelism tomorrow. > Improve error message when savepoint migration fails due to task removal > > > Key: FLINK-6742 > URL: https://issues.apache.org/jira/browse/FLINK-6742 > Project: Flink > Issue Type: Bug > Components: State Backends, Checkpointing >Affects Versions: 1.3.0 >Reporter: Gyula Fora >Assignee: Chesnay Schepler >Priority: Minor > Labels: flink-rel-1.3.1-blockers > > Caused by: java.lang.NullPointerException > at > org.apache.flink.runtime.checkpoint.savepoint.SavepointV2.convertToOperatorStateSavepointV2(SavepointV2.java:171) > at > org.apache.flink.runtime.checkpoint.savepoint.SavepointLoader.loadAndValidateSavepoint(SavepointLoader.java:75) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1090) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-6742) Improve error message when savepoint migration fails due to task removal
[ https://issues.apache.org/jira/browse/FLINK-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16062308#comment-16062308 ] Gyula Fora commented on FLINK-6742: --- Ah sorry Chesnay I missed it on the 1.3 branch :/ > Improve error message when savepoint migration fails due to task removal > > > Key: FLINK-6742 > URL: https://issues.apache.org/jira/browse/FLINK-6742 > Project: Flink > Issue Type: Bug > Components: State Backends, Checkpointing >Affects Versions: 1.3.0 >Reporter: Gyula Fora >Assignee: Chesnay Schepler >Priority: Minor > Labels: flink-rel-1.3.1-blockers > > Caused by: java.lang.NullPointerException > at > org.apache.flink.runtime.checkpoint.savepoint.SavepointV2.convertToOperatorStateSavepointV2(SavepointV2.java:171) > at > org.apache.flink.runtime.checkpoint.savepoint.SavepointLoader.loadAndValidateSavepoint(SavepointLoader.java:75) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1090) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-6742) Improve error message when savepoint migration fails due to task removal
[ https://issues.apache.org/jira/browse/FLINK-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16062297#comment-16062297 ] Chesnay Schepler commented on FLINK-6742: - 1.3: 2bbfe0292c13d875b531a6c168ea78bfc7f21f0b 1.4: 72b0ae069f8404a2f8a952e1a20004b9d340c445 > Improve error message when savepoint migration fails due to task removal > > > Key: FLINK-6742 > URL: https://issues.apache.org/jira/browse/FLINK-6742 > Project: Flink > Issue Type: Bug > Components: State Backends, Checkpointing >Affects Versions: 1.3.0 >Reporter: Gyula Fora >Assignee: Chesnay Schepler >Priority: Minor > Labels: flink-rel-1.3.1-blockers > > Caused by: java.lang.NullPointerException > at > org.apache.flink.runtime.checkpoint.savepoint.SavepointV2.convertToOperatorStateSavepointV2(SavepointV2.java:171) > at > org.apache.flink.runtime.checkpoint.savepoint.SavepointLoader.loadAndValidateSavepoint(SavepointLoader.java:75) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1090) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-6742) Improve error message when savepoint migration fails due to task removal
[ https://issues.apache.org/jira/browse/FLINK-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16062282#comment-16062282 ] ASF GitHub Bot commented on FLINK-6742: --- Github user gyfora commented on a diff in the pull request: https://github.com/apache/flink/pull/4083#discussion_r123895607 --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/savepoint/SavepointV2.java --- @@ -168,10 +168,27 @@ public static Savepoint convertToOperatorStateSavepointV2( expandedToLegacyIds = true; } + if (jobVertex == null) { + throw new IllegalStateException( + "Could not find task for state with ID " + taskState.getJobVertexID() + ". " + + "When migrating a savepoint from a version < 1.3 please make sure that the topology was not " + + "changed through removal of a stateful operator or modification of a chain containing a stateful " + + "operator."); + } + List operatorIDs = jobVertex.getOperatorIDs(); for (int subtaskIndex = 0; subtaskIndex < jobVertex.getParallelism(); subtaskIndex++) { - SubtaskState subtaskState = taskState.getState(subtaskIndex); + SubtaskState subtaskState; + try { + subtaskState = taskState.getState(subtaskIndex); --- End diff -- Sorry for commenting late on this but I have had some major migration issues in the last few days :D I think we should explicitly compare parallelism instead of relying on the error: if (taskState.getStates().size() != jobVertex.getParallelism()) --> error Otherwise this will not fail on lower parallelism. > Improve error message when savepoint migration fails due to task removal > > > Key: FLINK-6742 > URL: https://issues.apache.org/jira/browse/FLINK-6742 > Project: Flink > Issue Type: Bug > Components: State Backends, Checkpointing >Affects Versions: 1.3.0 >Reporter: Gyula Fora >Assignee: Chesnay Schepler >Priority: Minor > Labels: flink-rel-1.3.1-blockers > > Caused by: java.lang.NullPointerException > at > org.apache.flink.runtime.checkpoint.savepoint.SavepointV2.convertToOperatorStateSavepointV2(SavepointV2.java:171) > at > org.apache.flink.runtime.checkpoint.savepoint.SavepointLoader.loadAndValidateSavepoint(SavepointLoader.java:75) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1090) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-6742) Improve error message when savepoint migration fails due to task removal
[ https://issues.apache.org/jira/browse/FLINK-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16062229#comment-16062229 ] ASF GitHub Bot commented on FLINK-6742: --- Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/4083 > Improve error message when savepoint migration fails due to task removal > > > Key: FLINK-6742 > URL: https://issues.apache.org/jira/browse/FLINK-6742 > Project: Flink > Issue Type: Bug > Components: State Backends, Checkpointing >Affects Versions: 1.3.0 >Reporter: Gyula Fora >Assignee: Chesnay Schepler >Priority: Minor > Labels: flink-rel-1.3.1-blockers > > Caused by: java.lang.NullPointerException > at > org.apache.flink.runtime.checkpoint.savepoint.SavepointV2.convertToOperatorStateSavepointV2(SavepointV2.java:171) > at > org.apache.flink.runtime.checkpoint.savepoint.SavepointLoader.loadAndValidateSavepoint(SavepointLoader.java:75) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1090) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-6742) Improve error message when savepoint migration fails due to task removal
[ https://issues.apache.org/jira/browse/FLINK-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16042403#comment-16042403 ] ASF GitHub Bot commented on FLINK-6742: --- Github user rmetzger commented on the issue: https://github.com/apache/flink/pull/4083 Change looks good to merge! > Improve error message when savepoint migration fails due to task removal > > > Key: FLINK-6742 > URL: https://issues.apache.org/jira/browse/FLINK-6742 > Project: Flink > Issue Type: Bug > Components: State Backends, Checkpointing >Affects Versions: 1.3.0 >Reporter: Gyula Fora >Assignee: Chesnay Schepler >Priority: Minor > Labels: flink-rel-1.3.1-blockers > > Caused by: java.lang.NullPointerException > at > org.apache.flink.runtime.checkpoint.savepoint.SavepointV2.convertToOperatorStateSavepointV2(SavepointV2.java:171) > at > org.apache.flink.runtime.checkpoint.savepoint.SavepointLoader.loadAndValidateSavepoint(SavepointLoader.java:75) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1090) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-6742) Improve error message when savepoint migration fails due to task removal
[ https://issues.apache.org/jira/browse/FLINK-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16040645#comment-16040645 ] ASF GitHub Bot commented on FLINK-6742: --- GitHub user zentol opened a pull request: https://github.com/apache/flink/pull/4083 [FLINK-6742] Improve savepoint migration failure error message This PR improves the error messages if the savepoint migration fails because a stateful task was removed or the parallelism of stateful operator was changed. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zentol/flink 6742 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/4083.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4083 commit 6f701d17e7eb62a21f5c9466ba9acf8696ec9ab8 Author: zentolDate: 2017-06-07T10:03:21Z [FLINK-6742] Improve savepoint migration failure error message commit 38b07a7c4654a84b4370ed948be3ab76c28afad5 Author: zentol Date: 2017-06-07T10:03:57Z [hotfix] Improve readability in SPV2#convertToOperatorStateSavepointV2 > Improve error message when savepoint migration fails due to task removal > > > Key: FLINK-6742 > URL: https://issues.apache.org/jira/browse/FLINK-6742 > Project: Flink > Issue Type: Bug > Components: State Backends, Checkpointing >Affects Versions: 1.3.0 >Reporter: Gyula Fora >Assignee: Chesnay Schepler >Priority: Minor > Labels: flink-rel-1.3.1-blockers > > Caused by: java.lang.NullPointerException > at > org.apache.flink.runtime.checkpoint.savepoint.SavepointV2.convertToOperatorStateSavepointV2(SavepointV2.java:171) > at > org.apache.flink.runtime.checkpoint.savepoint.SavepointLoader.loadAndValidateSavepoint(SavepointLoader.java:75) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1090) -- This message was sent by Atlassian JIRA (v6.3.15#6346)