[jira] [Commented] (FLINK-6742) Improve error message when savepoint migration fails due to task removal

2017-07-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16071141#comment-16071141
 ] 

ASF GitHub Bot commented on FLINK-6742:
---

Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/4185


> Improve error message when savepoint migration fails due to task removal
> 
>
> Key: FLINK-6742
> URL: https://issues.apache.org/jira/browse/FLINK-6742
> Project: Flink
>  Issue Type: Bug
>  Components: State Backends, Checkpointing
>Affects Versions: 1.3.0
>Reporter: Gyula Fora
>Assignee: Chesnay Schepler
>Priority: Minor
>  Labels: flink-rel-1.3.1-blockers
>
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.flink.runtime.checkpoint.savepoint.SavepointV2.convertToOperatorStateSavepointV2(SavepointV2.java:171)
>   at 
> org.apache.flink.runtime.checkpoint.savepoint.SavepointLoader.loadAndValidateSavepoint(SavepointLoader.java:75)
>   at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1090)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-6742) Improve error message when savepoint migration fails due to task removal

2017-06-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16063193#comment-16063193
 ] 

ASF GitHub Bot commented on FLINK-6742:
---

Github user zentol commented on the issue:

https://github.com/apache/flink/pull/4185
  
Thank you for taking a look, merging this.


> Improve error message when savepoint migration fails due to task removal
> 
>
> Key: FLINK-6742
> URL: https://issues.apache.org/jira/browse/FLINK-6742
> Project: Flink
>  Issue Type: Bug
>  Components: State Backends, Checkpointing
>Affects Versions: 1.3.0
>Reporter: Gyula Fora
>Assignee: Chesnay Schepler
>Priority: Minor
>  Labels: flink-rel-1.3.1-blockers
>
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.flink.runtime.checkpoint.savepoint.SavepointV2.convertToOperatorStateSavepointV2(SavepointV2.java:171)
>   at 
> org.apache.flink.runtime.checkpoint.savepoint.SavepointLoader.loadAndValidateSavepoint(SavepointLoader.java:75)
>   at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1090)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-6742) Improve error message when savepoint migration fails due to task removal

2017-06-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16062973#comment-16062973
 ] 

ASF GitHub Bot commented on FLINK-6742:
---

GitHub user zentol opened a pull request:

https://github.com/apache/flink/pull/4185

[FLINK-6742] Add eager checks for parallelism/chain-length change

This is a follow-up to #4083 that adds checks to the savepoint migration 
for any change in parallelism or chain length.

Should be merged for 1.3 and 1.4.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zentol/flink 6742_2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/4185.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4185


commit 7f95bc69e4aced1ca1d89c1cbc2067150f5f583b
Author: zentol 
Date:   2017-06-26T11:38:54Z

[FLINK-6742] Add eager checks for parallelism/chain-length change




> Improve error message when savepoint migration fails due to task removal
> 
>
> Key: FLINK-6742
> URL: https://issues.apache.org/jira/browse/FLINK-6742
> Project: Flink
>  Issue Type: Bug
>  Components: State Backends, Checkpointing
>Affects Versions: 1.3.0
>Reporter: Gyula Fora
>Assignee: Chesnay Schepler
>Priority: Minor
>  Labels: flink-rel-1.3.1-blockers
>
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.flink.runtime.checkpoint.savepoint.SavepointV2.convertToOperatorStateSavepointV2(SavepointV2.java:171)
>   at 
> org.apache.flink.runtime.checkpoint.savepoint.SavepointLoader.loadAndValidateSavepoint(SavepointLoader.java:75)
>   at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1090)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-6742) Improve error message when savepoint migration fails due to task removal

2017-06-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16062868#comment-16062868
 ] 

ASF GitHub Bot commented on FLINK-6742:
---

Github user zentol commented on a diff in the pull request:

https://github.com/apache/flink/pull/4083#discussion_r123972388
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/savepoint/SavepointV2.java
 ---
@@ -168,10 +168,27 @@ public static Savepoint 
convertToOperatorStateSavepointV2(
expandedToLegacyIds = true;
}
 
+   if (jobVertex == null) {
+   throw new IllegalStateException(
+   "Could not find task for state with ID 
" + taskState.getJobVertexID() + ". " +
+   "When migrating a savepoint from a 
version < 1.3 please make sure that the topology was not " +
+   "changed through removal of a stateful 
operator or modification of a chain containing a stateful " +
+   "operator.");
+   }
+
List operatorIDs = 
jobVertex.getOperatorIDs();
 
for (int subtaskIndex = 0; subtaskIndex < 
jobVertex.getParallelism(); subtaskIndex++) {
-   SubtaskState subtaskState = 
taskState.getState(subtaskIndex);
+   SubtaskState subtaskState;
+   try {
+   subtaskState = 
taskState.getState(subtaskIndex);
--- End diff --

yes that's true, I'll create a follow-up PR.


> Improve error message when savepoint migration fails due to task removal
> 
>
> Key: FLINK-6742
> URL: https://issues.apache.org/jira/browse/FLINK-6742
> Project: Flink
>  Issue Type: Bug
>  Components: State Backends, Checkpointing
>Affects Versions: 1.3.0
>Reporter: Gyula Fora
>Assignee: Chesnay Schepler
>Priority: Minor
>  Labels: flink-rel-1.3.1-blockers
>
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.flink.runtime.checkpoint.savepoint.SavepointV2.convertToOperatorStateSavepointV2(SavepointV2.java:171)
>   at 
> org.apache.flink.runtime.checkpoint.savepoint.SavepointLoader.loadAndValidateSavepoint(SavepointLoader.java:75)
>   at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1090)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-6742) Improve error message when savepoint migration fails due to task removal

2017-06-25 Thread Chesnay Schepler (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16062364#comment-16062364
 ] 

Chesnay Schepler commented on FLINK-6742:
-

I merged it to 1.3 after your comment, so it was ok ;)

I'll think about your suggestion regarding the parallelism tomorrow.

> Improve error message when savepoint migration fails due to task removal
> 
>
> Key: FLINK-6742
> URL: https://issues.apache.org/jira/browse/FLINK-6742
> Project: Flink
>  Issue Type: Bug
>  Components: State Backends, Checkpointing
>Affects Versions: 1.3.0
>Reporter: Gyula Fora
>Assignee: Chesnay Schepler
>Priority: Minor
>  Labels: flink-rel-1.3.1-blockers
>
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.flink.runtime.checkpoint.savepoint.SavepointV2.convertToOperatorStateSavepointV2(SavepointV2.java:171)
>   at 
> org.apache.flink.runtime.checkpoint.savepoint.SavepointLoader.loadAndValidateSavepoint(SavepointLoader.java:75)
>   at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1090)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-6742) Improve error message when savepoint migration fails due to task removal

2017-06-25 Thread Gyula Fora (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16062308#comment-16062308
 ] 

Gyula Fora commented on FLINK-6742:
---

Ah sorry Chesnay I missed it on the 1.3 branch :/

> Improve error message when savepoint migration fails due to task removal
> 
>
> Key: FLINK-6742
> URL: https://issues.apache.org/jira/browse/FLINK-6742
> Project: Flink
>  Issue Type: Bug
>  Components: State Backends, Checkpointing
>Affects Versions: 1.3.0
>Reporter: Gyula Fora
>Assignee: Chesnay Schepler
>Priority: Minor
>  Labels: flink-rel-1.3.1-blockers
>
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.flink.runtime.checkpoint.savepoint.SavepointV2.convertToOperatorStateSavepointV2(SavepointV2.java:171)
>   at 
> org.apache.flink.runtime.checkpoint.savepoint.SavepointLoader.loadAndValidateSavepoint(SavepointLoader.java:75)
>   at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1090)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-6742) Improve error message when savepoint migration fails due to task removal

2017-06-25 Thread Chesnay Schepler (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16062297#comment-16062297
 ] 

Chesnay Schepler commented on FLINK-6742:
-

1.3: 2bbfe0292c13d875b531a6c168ea78bfc7f21f0b
1.4: 72b0ae069f8404a2f8a952e1a20004b9d340c445

> Improve error message when savepoint migration fails due to task removal
> 
>
> Key: FLINK-6742
> URL: https://issues.apache.org/jira/browse/FLINK-6742
> Project: Flink
>  Issue Type: Bug
>  Components: State Backends, Checkpointing
>Affects Versions: 1.3.0
>Reporter: Gyula Fora
>Assignee: Chesnay Schepler
>Priority: Minor
>  Labels: flink-rel-1.3.1-blockers
>
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.flink.runtime.checkpoint.savepoint.SavepointV2.convertToOperatorStateSavepointV2(SavepointV2.java:171)
>   at 
> org.apache.flink.runtime.checkpoint.savepoint.SavepointLoader.loadAndValidateSavepoint(SavepointLoader.java:75)
>   at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1090)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-6742) Improve error message when savepoint migration fails due to task removal

2017-06-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16062282#comment-16062282
 ] 

ASF GitHub Bot commented on FLINK-6742:
---

Github user gyfora commented on a diff in the pull request:

https://github.com/apache/flink/pull/4083#discussion_r123895607
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/savepoint/SavepointV2.java
 ---
@@ -168,10 +168,27 @@ public static Savepoint 
convertToOperatorStateSavepointV2(
expandedToLegacyIds = true;
}
 
+   if (jobVertex == null) {
+   throw new IllegalStateException(
+   "Could not find task for state with ID 
" + taskState.getJobVertexID() + ". " +
+   "When migrating a savepoint from a 
version < 1.3 please make sure that the topology was not " +
+   "changed through removal of a stateful 
operator or modification of a chain containing a stateful " +
+   "operator.");
+   }
+
List operatorIDs = 
jobVertex.getOperatorIDs();
 
for (int subtaskIndex = 0; subtaskIndex < 
jobVertex.getParallelism(); subtaskIndex++) {
-   SubtaskState subtaskState = 
taskState.getState(subtaskIndex);
+   SubtaskState subtaskState;
+   try {
+   subtaskState = 
taskState.getState(subtaskIndex);
--- End diff --

Sorry for commenting late on this but I have had some major migration 
issues in the last few days :D 
I think we should explicitly compare parallelism instead of relying on the 
error:
if (taskState.getStates().size() != jobVertex.getParallelism()) --> error

Otherwise this will not fail on lower parallelism.


> Improve error message when savepoint migration fails due to task removal
> 
>
> Key: FLINK-6742
> URL: https://issues.apache.org/jira/browse/FLINK-6742
> Project: Flink
>  Issue Type: Bug
>  Components: State Backends, Checkpointing
>Affects Versions: 1.3.0
>Reporter: Gyula Fora
>Assignee: Chesnay Schepler
>Priority: Minor
>  Labels: flink-rel-1.3.1-blockers
>
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.flink.runtime.checkpoint.savepoint.SavepointV2.convertToOperatorStateSavepointV2(SavepointV2.java:171)
>   at 
> org.apache.flink.runtime.checkpoint.savepoint.SavepointLoader.loadAndValidateSavepoint(SavepointLoader.java:75)
>   at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1090)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-6742) Improve error message when savepoint migration fails due to task removal

2017-06-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16062229#comment-16062229
 ] 

ASF GitHub Bot commented on FLINK-6742:
---

Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/4083


> Improve error message when savepoint migration fails due to task removal
> 
>
> Key: FLINK-6742
> URL: https://issues.apache.org/jira/browse/FLINK-6742
> Project: Flink
>  Issue Type: Bug
>  Components: State Backends, Checkpointing
>Affects Versions: 1.3.0
>Reporter: Gyula Fora
>Assignee: Chesnay Schepler
>Priority: Minor
>  Labels: flink-rel-1.3.1-blockers
>
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.flink.runtime.checkpoint.savepoint.SavepointV2.convertToOperatorStateSavepointV2(SavepointV2.java:171)
>   at 
> org.apache.flink.runtime.checkpoint.savepoint.SavepointLoader.loadAndValidateSavepoint(SavepointLoader.java:75)
>   at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1090)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-6742) Improve error message when savepoint migration fails due to task removal

2017-06-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16042403#comment-16042403
 ] 

ASF GitHub Bot commented on FLINK-6742:
---

Github user rmetzger commented on the issue:

https://github.com/apache/flink/pull/4083
  
Change looks good to merge!


> Improve error message when savepoint migration fails due to task removal
> 
>
> Key: FLINK-6742
> URL: https://issues.apache.org/jira/browse/FLINK-6742
> Project: Flink
>  Issue Type: Bug
>  Components: State Backends, Checkpointing
>Affects Versions: 1.3.0
>Reporter: Gyula Fora
>Assignee: Chesnay Schepler
>Priority: Minor
>  Labels: flink-rel-1.3.1-blockers
>
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.flink.runtime.checkpoint.savepoint.SavepointV2.convertToOperatorStateSavepointV2(SavepointV2.java:171)
>   at 
> org.apache.flink.runtime.checkpoint.savepoint.SavepointLoader.loadAndValidateSavepoint(SavepointLoader.java:75)
>   at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1090)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-6742) Improve error message when savepoint migration fails due to task removal

2017-06-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16040645#comment-16040645
 ] 

ASF GitHub Bot commented on FLINK-6742:
---

GitHub user zentol opened a pull request:

https://github.com/apache/flink/pull/4083

[FLINK-6742] Improve savepoint migration failure error message

This PR improves the error messages if the savepoint migration fails 
because a stateful task was removed or the parallelism of stateful operator was 
changed.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zentol/flink 6742

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/4083.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4083


commit 6f701d17e7eb62a21f5c9466ba9acf8696ec9ab8
Author: zentol 
Date:   2017-06-07T10:03:21Z

[FLINK-6742] Improve savepoint migration failure error message

commit 38b07a7c4654a84b4370ed948be3ab76c28afad5
Author: zentol 
Date:   2017-06-07T10:03:57Z

[hotfix] Improve readability in SPV2#convertToOperatorStateSavepointV2




> Improve error message when savepoint migration fails due to task removal
> 
>
> Key: FLINK-6742
> URL: https://issues.apache.org/jira/browse/FLINK-6742
> Project: Flink
>  Issue Type: Bug
>  Components: State Backends, Checkpointing
>Affects Versions: 1.3.0
>Reporter: Gyula Fora
>Assignee: Chesnay Schepler
>Priority: Minor
>  Labels: flink-rel-1.3.1-blockers
>
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.flink.runtime.checkpoint.savepoint.SavepointV2.convertToOperatorStateSavepointV2(SavepointV2.java:171)
>   at 
> org.apache.flink.runtime.checkpoint.savepoint.SavepointLoader.loadAndValidateSavepoint(SavepointLoader.java:75)
>   at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1090)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)