[jira] [Commented] (FLINK-8976) End-to-end test: Resume with different parallelism
[ https://issues.apache.org/jira/browse/FLINK-8976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16413429#comment-16413429 ] ASF GitHub Bot commented on FLINK-8976: --- Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/5745 > End-to-end test: Resume with different parallelism > -- > > Key: FLINK-8976 > URL: https://issues.apache.org/jira/browse/FLINK-8976 > Project: Flink > Issue Type: Sub-task > Components: Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Tzu-Li (Gordon) Tai >Priority: Blocker > Fix For: 1.5.0 > > > Similar to FLINK-8975, we should have an end-to-end test which resumes a job > with a different parallelism after taking > a) a savepoint > b) from the last retained checkpoint -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8976) End-to-end test: Resume with different parallelism
[ https://issues.apache.org/jira/browse/FLINK-8976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16411091#comment-16411091 ] ASF GitHub Bot commented on FLINK-8976: --- Github user zentol commented on the issue: https://github.com/apache/flink/pull/5745 we may improve the test times by writing a custom reporter specifically for this test that writes to the log if the condition is met. > End-to-end test: Resume with different parallelism > -- > > Key: FLINK-8976 > URL: https://issues.apache.org/jira/browse/FLINK-8976 > Project: Flink > Issue Type: Sub-task > Components: Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Tzu-Li (Gordon) Tai >Priority: Blocker > Fix For: 1.5.0 > > > Similar to FLINK-8975, we should have an end-to-end test which resumes a job > with a different parallelism after taking > a) a savepoint > b) from the last retained checkpoint -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8976) End-to-end test: Resume with different parallelism
[ https://issues.apache.org/jira/browse/FLINK-8976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16411075#comment-16411075 ] ASF GitHub Bot commented on FLINK-8976: --- Github user tzulitai commented on a diff in the pull request: https://github.com/apache/flink/pull/5745#discussion_r176678588 --- Diff: flink-end-to-end-tests/test-scripts/test_resume_savepoint.sh --- @@ -17,11 +17,25 @@ # limitations under the License. +if [ -z $1 ] || [ -z $2 ]; then + echo "Usage: ./test_resume_savepoint.sh " + exit 1 +fi + source "$(dirname "$0")"/common.sh -# modify configuration to have 2 slots +ORIGINAL_DOP=$1 +NEW_DOP=$2 + +if (( $ORIGINAL_DOP >= $NEW_DOP )); then + NUM_SLOTS=$(( $ORIGINAL_DOP + 1 )) --- End diff -- Yes, it is for the Kafka event generator job. > End-to-end test: Resume with different parallelism > -- > > Key: FLINK-8976 > URL: https://issues.apache.org/jira/browse/FLINK-8976 > Project: Flink > Issue Type: Sub-task > Components: Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Tzu-Li (Gordon) Tai >Priority: Blocker > Fix For: 1.5.0 > > > Similar to FLINK-8975, we should have an end-to-end test which resumes a job > with a different parallelism after taking > a) a savepoint > b) from the last retained checkpoint -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8976) End-to-end test: Resume with different parallelism
[ https://issues.apache.org/jira/browse/FLINK-8976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16411078#comment-16411078 ] ASF GitHub Bot commented on FLINK-8976: --- Github user tzulitai commented on the issue: https://github.com/apache/flink/pull/5745 Thanks @zentol. I'll merge this (as well as #5733) with your comments addressed, as soon as Travis shows green. > End-to-end test: Resume with different parallelism > -- > > Key: FLINK-8976 > URL: https://issues.apache.org/jira/browse/FLINK-8976 > Project: Flink > Issue Type: Sub-task > Components: Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Tzu-Li (Gordon) Tai >Priority: Blocker > Fix For: 1.5.0 > > > Similar to FLINK-8975, we should have an end-to-end test which resumes a job > with a different parallelism after taking > a) a savepoint > b) from the last retained checkpoint -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8976) End-to-end test: Resume with different parallelism
[ https://issues.apache.org/jira/browse/FLINK-8976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16411074#comment-16411074 ] ASF GitHub Bot commented on FLINK-8976: --- Github user tzulitai commented on a diff in the pull request: https://github.com/apache/flink/pull/5745#discussion_r176678540 --- Diff: flink-end-to-end-tests/test-scripts/test_resume_savepoint.sh --- @@ -17,11 +17,25 @@ # limitations under the License. +if [ -z $1 ] || [ -z $2 ]; then + echo "Usage: ./test_resume_savepoint.sh " + exit 1 +fi + source "$(dirname "$0")"/common.sh -# modify configuration to have 2 slots +ORIGINAL_DOP=$1 +NEW_DOP=$2 + +if (( $ORIGINAL_DOP >= $NEW_DOP )); then + NUM_SLOTS=$(( $ORIGINAL_DOP + 1 )) +else + NUM_SLOTS=$(( $NEW_DOP + 1 )) +fi + +# modify configuration to have enough slots cp $FLINK_DIR/conf/flink-conf.yaml $FLINK_DIR/conf/flink-conf.yaml.bak -sed -i -e 's/taskmanager.numberOfTaskSlots: 1/taskmanager.numberOfTaskSlots: 2/' $FLINK_DIR/conf/flink-conf.yaml +sed -i -e "s/taskmanager.numberOfTaskSlots: 1/taskmanager.numberOfTaskSlots: $NUM_SLOTS/" $FLINK_DIR/conf/flink-conf.yaml # modify configuration to use SLF4J reporter; we will be using this to monitor the state machine progress cp $FLINK_DIR/opt/flink-metrics-slf4j-1.6-SNAPSHOT.jar $FLINK_DIR/lib/ --- End diff -- That makes sense, will fix this. > End-to-end test: Resume with different parallelism > -- > > Key: FLINK-8976 > URL: https://issues.apache.org/jira/browse/FLINK-8976 > Project: Flink > Issue Type: Sub-task > Components: Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Tzu-Li (Gordon) Tai >Priority: Blocker > Fix For: 1.5.0 > > > Similar to FLINK-8975, we should have an end-to-end test which resumes a job > with a different parallelism after taking > a) a savepoint > b) from the last retained checkpoint -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8976) End-to-end test: Resume with different parallelism
[ https://issues.apache.org/jira/browse/FLINK-8976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16411068#comment-16411068 ] ASF GitHub Bot commented on FLINK-8976: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5745#discussion_r176677065 --- Diff: flink-end-to-end-tests/test-scripts/test_resume_savepoint.sh --- @@ -17,11 +17,25 @@ # limitations under the License. +if [ -z $1 ] || [ -z $2 ]; then + echo "Usage: ./test_resume_savepoint.sh " + exit 1 +fi + source "$(dirname "$0")"/common.sh -# modify configuration to have 2 slots +ORIGINAL_DOP=$1 +NEW_DOP=$2 + +if (( $ORIGINAL_DOP >= $NEW_DOP )); then + NUM_SLOTS=$(( $ORIGINAL_DOP + 1 )) --- End diff -- would be good to explain where the +1 comes from. (i guess it is for the kafka job) > End-to-end test: Resume with different parallelism > -- > > Key: FLINK-8976 > URL: https://issues.apache.org/jira/browse/FLINK-8976 > Project: Flink > Issue Type: Sub-task > Components: Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Tzu-Li (Gordon) Tai >Priority: Blocker > Fix For: 1.5.0 > > > Similar to FLINK-8975, we should have an end-to-end test which resumes a job > with a different parallelism after taking > a) a savepoint > b) from the last retained checkpoint -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8976) End-to-end test: Resume with different parallelism
[ https://issues.apache.org/jira/browse/FLINK-8976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16411067#comment-16411067 ] ASF GitHub Bot commented on FLINK-8976: --- Github user zentol commented on a diff in the pull request: https://github.com/apache/flink/pull/5745#discussion_r176677259 --- Diff: flink-end-to-end-tests/test-scripts/test_resume_savepoint.sh --- @@ -17,11 +17,25 @@ # limitations under the License. +if [ -z $1 ] || [ -z $2 ]; then + echo "Usage: ./test_resume_savepoint.sh " + exit 1 +fi + source "$(dirname "$0")"/common.sh -# modify configuration to have 2 slots +ORIGINAL_DOP=$1 +NEW_DOP=$2 + +if (( $ORIGINAL_DOP >= $NEW_DOP )); then + NUM_SLOTS=$(( $ORIGINAL_DOP + 1 )) +else + NUM_SLOTS=$(( $NEW_DOP + 1 )) +fi + +# modify configuration to have enough slots cp $FLINK_DIR/conf/flink-conf.yaml $FLINK_DIR/conf/flink-conf.yaml.bak -sed -i -e 's/taskmanager.numberOfTaskSlots: 1/taskmanager.numberOfTaskSlots: 2/' $FLINK_DIR/conf/flink-conf.yaml +sed -i -e "s/taskmanager.numberOfTaskSlots: 1/taskmanager.numberOfTaskSlots: $NUM_SLOTS/" $FLINK_DIR/conf/flink-conf.yaml # modify configuration to use SLF4J reporter; we will be using this to monitor the state machine progress cp $FLINK_DIR/opt/flink-metrics-slf4j-1.6-SNAPSHOT.jar $FLINK_DIR/lib/ --- End diff -- can we define the version as an environment variable in the common.sh? > End-to-end test: Resume with different parallelism > -- > > Key: FLINK-8976 > URL: https://issues.apache.org/jira/browse/FLINK-8976 > Project: Flink > Issue Type: Sub-task > Components: Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Tzu-Li (Gordon) Tai >Priority: Blocker > Fix For: 1.5.0 > > > Similar to FLINK-8975, we should have an end-to-end test which resumes a job > with a different parallelism after taking > a) a savepoint > b) from the last retained checkpoint -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8976) End-to-end test: Resume with different parallelism
[ https://issues.apache.org/jira/browse/FLINK-8976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16409264#comment-16409264 ] ASF GitHub Bot commented on FLINK-8976: --- GitHub user tzulitai opened a pull request: https://github.com/apache/flink/pull/5745 [FLINK-8976] [test] Add end-to-end test for resuming savepoints with different parallelism ## What is the purpose of the change This PR adds end-to-end tests for resuming a savepoint with different parallelisms. ## Brief change log The changes are based on the new `test_resume_savepoint.sh` test script in #5733, so only the last commit is relevant. That script has been modified to be able to specify parallelism of the state machine job, before and after the savepoint restore. - Adapt `test_resume_savepoint.sh` to be able to use different parallelism after the savepoint restore. - Add scale up / scale down tests to be executed by Travis. ## Verifying this change This PR adds new tests. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (yes / **no**) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (yes / **no**) - The serializers: (yes / **no** / don't know) - The runtime per-record code paths (performance sensitive): (yes / **no** / don't know) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / **no** / don't know) - The S3 file system connector: (yes / **no** / don't know) ## Documentation - Does this pull request introduce a new feature? (yes / **no**) - If yes, how is the feature documented? (**not applicable** / docs / JavaDocs / not documented) You can merge this pull request into a Git repository by running: $ git pull https://github.com/tzulitai/flink FLINK-8976 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/5745.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5745 commit 529e060cb05fd723b8656dcc9ef48f8011282dd8 Author: Tzu-Li (Gordon) TaiDate: 2018-03-21T08:25:37Z [FLINK-8975] [test] Add Kafka events generator job for StateMachineExample commit 6b5126c006c752c6c1bee2699d429927e74587c9 Author: Tzu-Li (Gordon) Tai Date: 2018-03-21T08:32:51Z [FLINK-8975] [test] Add resume from savepoint end-to-end test commit 475ef4de0f8abbd3bb485e0e6fefcce2612074d3 Author: Tzu-Li (Gordon) Tai Date: 2018-03-22T07:34:49Z fixup! Use SLF4J reporter to monitor state machine progress commit 55d69567bc4cb61cd08ad852126c9c2c39b8feb4 Author: Tzu-Li (Gordon) Tai Date: 2018-03-22T09:09:53Z [FLINK-8976] [test] Add end-to-end tests for resuming savepoint with differrent parallelism > End-to-end test: Resume with different parallelism > -- > > Key: FLINK-8976 > URL: https://issues.apache.org/jira/browse/FLINK-8976 > Project: Flink > Issue Type: Sub-task > Components: Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Tzu-Li (Gordon) Tai >Priority: Blocker > Fix For: 1.5.0 > > > Similar to FLINK-8975, we should have an end-to-end test which resumes a job > with a different parallelism after taking > a) a savepoint > b) from the last retained checkpoint -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8976) End-to-end test: Resume with different parallelism
[ https://issues.apache.org/jira/browse/FLINK-8976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405096#comment-16405096 ] Till Rohrmann commented on FLINK-8976: -- My oversight. It should also work with incremental checkpoints. At least it worked for me when trying it out. I will correct the description. > End-to-end test: Resume with different parallelism > -- > > Key: FLINK-8976 > URL: https://issues.apache.org/jira/browse/FLINK-8976 > Project: Flink > Issue Type: Sub-task > Components: Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Tzu-Li (Gordon) Tai >Priority: Blocker > Fix For: 1.5.0 > > > Similar to FLINK-8975, we should have an end-to-end test which resumes a job > with a different parallelism after taking > a) a savepoint > b) from the last retained checkpoint (this won't work with RocksDB > incremental checkpoints) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8976) End-to-end test: Resume with different parallelism
[ https://issues.apache.org/jira/browse/FLINK-8976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402023#comment-16402023 ] Sihua Zhou commented on FLINK-8976: --- Hi [~till.rohrmann] could I ask why this issue won't work with RocksDB incremental checkpoints? > End-to-end test: Resume with different parallelism > -- > > Key: FLINK-8976 > URL: https://issues.apache.org/jira/browse/FLINK-8976 > Project: Flink > Issue Type: Sub-task > Components: Tests >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Tzu-Li (Gordon) Tai >Priority: Blocker > Fix For: 1.5.0 > > > Similar to FLINK-8975, we should have an end-to-end test which resumes a job > with a different parallelism after taking > a) a savepoint > b) from the last retained checkpoint (this won't work with RocksDB > incremental checkpoints) -- This message was sent by Atlassian JIRA (v7.6.3#76005)