[jira] [Commented] (FLINK-16770) Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end test fails with no such file
[ https://issues.apache.org/jira/browse/FLINK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17093108#comment-17093108 ] Robert Metzger commented on FLINK-16770: [~SleePy] thanks again for the analysis of the log output. I have filed a new ticket for the failure, as it is quite likely that this issue is not related to the broken checkpoint coordinator, but rather to a problem with the test scripts, or the TaskManagers. Let's track the analysis of this failure pattern in FLINK-17404. > Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end > test fails with no such file > --- > > Key: FLINK-16770 > URL: https://issues.apache.org/jira/browse/FLINK-16770 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Tests >Affects Versions: 1.11.0 >Reporter: Zhijiang >Assignee: Yun Tang >Priority: Blocker > Labels: pull-request-available, test-stability > Fix For: 1.11.0 > > Attachments: e2e-output.log, > flink-vsts-standalonesession-0-fv-az53.log, image-2020-04-16-11-24-54-549.png > > Time Spent: 50m > Remaining Estimate: 0h > > The log : > [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6603=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5] > > There was also the similar problem in > https://issues.apache.org/jira/browse/FLINK-16561, but for the case of no > parallelism change. And this case is for scaling up. Not quite sure whether > the root cause is the same one. > {code:java} > 2020-03-25T06:50:31.3894841Z Running 'Resuming Externalized Checkpoint > (rocks, incremental, scale up) end-to-end test' > 2020-03-25T06:50:31.3895308Z > == > 2020-03-25T06:50:31.3907274Z TEST_DATA_DIR: > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304 > 2020-03-25T06:50:31.5500274Z Flink dist directory: > /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT > 2020-03-25T06:50:31.6354639Z Starting cluster. > 2020-03-25T06:50:31.8871932Z Starting standalonesession daemon on host > fv-az655. > 2020-03-25T06:50:33.5021784Z Starting taskexecutor daemon on host fv-az655. > 2020-03-25T06:50:33.5152274Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:34.5498116Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:35.6031346Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:36.9848425Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:38.0283377Z Dispatcher REST endpoint is up. > 2020-03-25T06:50:38.0285490Z Running externalized checkpoints test, with > ORIGINAL_DOP=2 NEW_DOP=4 and STATE_BACKEND_TYPE=rocks > STATE_BACKEND_FILE_ASYNC=true STATE_BACKEND_ROCKSDB_INCREMENTAL=true > SIMULATE_FAILURE=false ... > 2020-03-25T06:50:46.1754645Z Job (b8cb04e4b1e730585bc616aa352866d0) is > running. > 2020-03-25T06:50:46.1758132Z Waiting for job > (b8cb04e4b1e730585bc616aa352866d0) to have at least 1 completed checkpoints > ... > 2020-03-25T06:50:46.3478276Z Waiting for job to process up to 200 records, > current progress: 173 records ... > 2020-03-25T06:50:49.6332988Z Cancelling job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.4875673Z Cancelled job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.5468230Z ls: cannot access > '/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304/externalized-chckpt-e2e-backend-dir/b8cb04e4b1e730585bc616aa352866d0/chk-[1-9]*/_metadata': > No such file or directory > 2020-03-25T06:50:50.5606260Z Restoring job with externalized checkpoint at . > ... > 2020-03-25T06:50:58.4728245Z > 2020-03-25T06:50:58.4732663Z > > 2020-03-25T06:50:58.4735785Z The program finished with the following > exception: > 2020-03-25T06:50:58.4737759Z > 2020-03-25T06:50:58.4742666Z > org.apache.flink.client.program.ProgramInvocationException: The main method > caused an error: java.util.concurrent.ExecutionException: > org.apache.flink.runtime.client.JobSubmissionException: Failed to submit > JobGraph. > 2020-03-25T06:50:58.4746274Z at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335) > 2020-03-25T06:50:58.4749954Z at > org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205) > 2020-03-25T06:50:58.4752753Z at > org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:142) > 2020-03-25T06:50:58.4755400Z at >
[jira] [Commented] (FLINK-16770) Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end test fails with no such file
[ https://issues.apache.org/jira/browse/FLINK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17093087#comment-17093087 ] Robert Metzger commented on FLINK-16770: No, there is no way to get the logs. The load-upload functionality is never executed, because the script hangs forever in an error case. The VM executing the test has been destroyed. We need to merge my PR, wait for the issue to appear again and then look at the logs. > Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end > test fails with no such file > --- > > Key: FLINK-16770 > URL: https://issues.apache.org/jira/browse/FLINK-16770 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Tests >Affects Versions: 1.11.0 >Reporter: Zhijiang >Assignee: Yun Tang >Priority: Blocker > Labels: pull-request-available, test-stability > Fix For: 1.11.0 > > Attachments: e2e-output.log, > flink-vsts-standalonesession-0-fv-az53.log, image-2020-04-16-11-24-54-549.png > > Time Spent: 50m > Remaining Estimate: 0h > > The log : > [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6603=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5] > > There was also the similar problem in > https://issues.apache.org/jira/browse/FLINK-16561, but for the case of no > parallelism change. And this case is for scaling up. Not quite sure whether > the root cause is the same one. > {code:java} > 2020-03-25T06:50:31.3894841Z Running 'Resuming Externalized Checkpoint > (rocks, incremental, scale up) end-to-end test' > 2020-03-25T06:50:31.3895308Z > == > 2020-03-25T06:50:31.3907274Z TEST_DATA_DIR: > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304 > 2020-03-25T06:50:31.5500274Z Flink dist directory: > /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT > 2020-03-25T06:50:31.6354639Z Starting cluster. > 2020-03-25T06:50:31.8871932Z Starting standalonesession daemon on host > fv-az655. > 2020-03-25T06:50:33.5021784Z Starting taskexecutor daemon on host fv-az655. > 2020-03-25T06:50:33.5152274Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:34.5498116Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:35.6031346Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:36.9848425Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:38.0283377Z Dispatcher REST endpoint is up. > 2020-03-25T06:50:38.0285490Z Running externalized checkpoints test, with > ORIGINAL_DOP=2 NEW_DOP=4 and STATE_BACKEND_TYPE=rocks > STATE_BACKEND_FILE_ASYNC=true STATE_BACKEND_ROCKSDB_INCREMENTAL=true > SIMULATE_FAILURE=false ... > 2020-03-25T06:50:46.1754645Z Job (b8cb04e4b1e730585bc616aa352866d0) is > running. > 2020-03-25T06:50:46.1758132Z Waiting for job > (b8cb04e4b1e730585bc616aa352866d0) to have at least 1 completed checkpoints > ... > 2020-03-25T06:50:46.3478276Z Waiting for job to process up to 200 records, > current progress: 173 records ... > 2020-03-25T06:50:49.6332988Z Cancelling job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.4875673Z Cancelled job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.5468230Z ls: cannot access > '/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304/externalized-chckpt-e2e-backend-dir/b8cb04e4b1e730585bc616aa352866d0/chk-[1-9]*/_metadata': > No such file or directory > 2020-03-25T06:50:50.5606260Z Restoring job with externalized checkpoint at . > ... > 2020-03-25T06:50:58.4728245Z > 2020-03-25T06:50:58.4732663Z > > 2020-03-25T06:50:58.4735785Z The program finished with the following > exception: > 2020-03-25T06:50:58.4737759Z > 2020-03-25T06:50:58.4742666Z > org.apache.flink.client.program.ProgramInvocationException: The main method > caused an error: java.util.concurrent.ExecutionException: > org.apache.flink.runtime.client.JobSubmissionException: Failed to submit > JobGraph. > 2020-03-25T06:50:58.4746274Z at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335) > 2020-03-25T06:50:58.4749954Z at > org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205) > 2020-03-25T06:50:58.4752753Z at > org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:142) > 2020-03-25T06:50:58.4755400Z at > org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:659) > 2020-03-25T06:50:58.4757862Z at
[jira] [Commented] (FLINK-16770) Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end test fails with no such file
[ https://issues.apache.org/jira/browse/FLINK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17092098#comment-17092098 ] Biao Liu commented on FLINK-16770: -- Technically speaking, the scenario we discussed here should not happen with the reverted codes. The finalization of checkpoint is reverted to be executed synchronously and wrapped in the coordinator-wide lock. There shouldn't be race condition at all. On the other hand, the earlier commits of the refactoring are merged over 3 months ago. So to answer the question of [~pnowojski], I think we have reverted enough commits. I have noticed that there are some logs: {quote}kill: usage: kill [-s sigspec | -n signum | -sigspec] pid | jobspec ... or kill -l [sigspec] Killed TM @ {quote} It seems that there is no TM process at some time. I guess it's not a normal scenario. The {{ha_tm_watchdog}} in common_ha.sh should start a new TM before killing an old one in this case. What if there is no TM process at all? Exited or killed unexpectedly? I'm not sure. I think there will be no enough TM to finish the testing case. Because the {{ha_tm_watchdog}} only starts a new TM if there are enough TMs, {quote}local MISSING_TMS=$((EXPECTED_TMS-RUNNING_TMS)) if [ ${MISSING_TMS} -eq 0 ]; then # start a new TM only if we have exactly the expected number "$FLINK_DIR"/bin/taskmanager.sh start > /dev/null fi{quote} I guess the failure cause is another one, maybe it's relevant to the "no TM process". But I can't tell what really happened in this case without any other logs. Is there any way we could find the JM logs? [~rmetzger] > Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end > test fails with no such file > --- > > Key: FLINK-16770 > URL: https://issues.apache.org/jira/browse/FLINK-16770 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Tests >Affects Versions: 1.11.0 >Reporter: Zhijiang >Assignee: Yun Tang >Priority: Blocker > Labels: pull-request-available, test-stability > Fix For: 1.11.0 > > Attachments: e2e-output.log, > flink-vsts-standalonesession-0-fv-az53.log, image-2020-04-16-11-24-54-549.png > > Time Spent: 50m > Remaining Estimate: 0h > > The log : > [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6603=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5] > > There was also the similar problem in > https://issues.apache.org/jira/browse/FLINK-16561, but for the case of no > parallelism change. And this case is for scaling up. Not quite sure whether > the root cause is the same one. > {code:java} > 2020-03-25T06:50:31.3894841Z Running 'Resuming Externalized Checkpoint > (rocks, incremental, scale up) end-to-end test' > 2020-03-25T06:50:31.3895308Z > == > 2020-03-25T06:50:31.3907274Z TEST_DATA_DIR: > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304 > 2020-03-25T06:50:31.5500274Z Flink dist directory: > /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT > 2020-03-25T06:50:31.6354639Z Starting cluster. > 2020-03-25T06:50:31.8871932Z Starting standalonesession daemon on host > fv-az655. > 2020-03-25T06:50:33.5021784Z Starting taskexecutor daemon on host fv-az655. > 2020-03-25T06:50:33.5152274Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:34.5498116Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:35.6031346Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:36.9848425Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:38.0283377Z Dispatcher REST endpoint is up. > 2020-03-25T06:50:38.0285490Z Running externalized checkpoints test, with > ORIGINAL_DOP=2 NEW_DOP=4 and STATE_BACKEND_TYPE=rocks > STATE_BACKEND_FILE_ASYNC=true STATE_BACKEND_ROCKSDB_INCREMENTAL=true > SIMULATE_FAILURE=false ... > 2020-03-25T06:50:46.1754645Z Job (b8cb04e4b1e730585bc616aa352866d0) is > running. > 2020-03-25T06:50:46.1758132Z Waiting for job > (b8cb04e4b1e730585bc616aa352866d0) to have at least 1 completed checkpoints > ... > 2020-03-25T06:50:46.3478276Z Waiting for job to process up to 200 records, > current progress: 173 records ... > 2020-03-25T06:50:49.6332988Z Cancelling job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.4875673Z Cancelled job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.5468230Z ls: cannot access >
[jira] [Commented] (FLINK-16770) Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end test fails with no such file
[ https://issues.apache.org/jira/browse/FLINK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091604#comment-17091604 ] Piotr Nowojski commented on FLINK-16770: Definitely your PR includes the reverted commits, so there is something to investigate here. Either there is another error with similar symptoms, or we haven't reverted enough commits. [~SleePy] [~yunta] what do you think has happened? For one thing I forgot to mention in this ticket that I've reverted the commits. We also need to clean up the tickets for this issue. I wanted to close this bug, but we were discussing solutions here, but I guess that was a mistake - after reverting the commits and re-opening the original issue, we should migrated discussions there. So lets keep this ticket open for the investigation of your most recent report [~rmetzger]. > Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end > test fails with no such file > --- > > Key: FLINK-16770 > URL: https://issues.apache.org/jira/browse/FLINK-16770 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Tests >Affects Versions: 1.11.0 >Reporter: Zhijiang >Assignee: Yun Tang >Priority: Blocker > Labels: pull-request-available, test-stability > Fix For: 1.11.0 > > Attachments: e2e-output.log, > flink-vsts-standalonesession-0-fv-az53.log, image-2020-04-16-11-24-54-549.png > > Time Spent: 50m > Remaining Estimate: 0h > > The log : > [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6603=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5] > > There was also the similar problem in > https://issues.apache.org/jira/browse/FLINK-16561, but for the case of no > parallelism change. And this case is for scaling up. Not quite sure whether > the root cause is the same one. > {code:java} > 2020-03-25T06:50:31.3894841Z Running 'Resuming Externalized Checkpoint > (rocks, incremental, scale up) end-to-end test' > 2020-03-25T06:50:31.3895308Z > == > 2020-03-25T06:50:31.3907274Z TEST_DATA_DIR: > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304 > 2020-03-25T06:50:31.5500274Z Flink dist directory: > /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT > 2020-03-25T06:50:31.6354639Z Starting cluster. > 2020-03-25T06:50:31.8871932Z Starting standalonesession daemon on host > fv-az655. > 2020-03-25T06:50:33.5021784Z Starting taskexecutor daemon on host fv-az655. > 2020-03-25T06:50:33.5152274Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:34.5498116Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:35.6031346Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:36.9848425Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:38.0283377Z Dispatcher REST endpoint is up. > 2020-03-25T06:50:38.0285490Z Running externalized checkpoints test, with > ORIGINAL_DOP=2 NEW_DOP=4 and STATE_BACKEND_TYPE=rocks > STATE_BACKEND_FILE_ASYNC=true STATE_BACKEND_ROCKSDB_INCREMENTAL=true > SIMULATE_FAILURE=false ... > 2020-03-25T06:50:46.1754645Z Job (b8cb04e4b1e730585bc616aa352866d0) is > running. > 2020-03-25T06:50:46.1758132Z Waiting for job > (b8cb04e4b1e730585bc616aa352866d0) to have at least 1 completed checkpoints > ... > 2020-03-25T06:50:46.3478276Z Waiting for job to process up to 200 records, > current progress: 173 records ... > 2020-03-25T06:50:49.6332988Z Cancelling job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.4875673Z Cancelled job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.5468230Z ls: cannot access > '/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304/externalized-chckpt-e2e-backend-dir/b8cb04e4b1e730585bc616aa352866d0/chk-[1-9]*/_metadata': > No such file or directory > 2020-03-25T06:50:50.5606260Z Restoring job with externalized checkpoint at . > ... > 2020-03-25T06:50:58.4728245Z > 2020-03-25T06:50:58.4732663Z > > 2020-03-25T06:50:58.4735785Z The program finished with the following > exception: > 2020-03-25T06:50:58.4737759Z > 2020-03-25T06:50:58.4742666Z > org.apache.flink.client.program.ProgramInvocationException: The main method > caused an error: java.util.concurrent.ExecutionException: > org.apache.flink.runtime.client.JobSubmissionException: Failed to submit > JobGraph. > 2020-03-25T06:50:58.4746274Z at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335) >
[jira] [Commented] (FLINK-16770) Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end test fails with no such file
[ https://issues.apache.org/jira/browse/FLINK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091554#comment-17091554 ] Robert Metzger commented on FLINK-16770: I didn't realize that you've reverted the commits potentially causing this. The run is based on this commit https://github.com/apache/flink/commit/008e0afb3c62c059dcdf2c58a43cdd2e2d283512 (this is the run: https://travis-ci.org/github/apache/flink/jobs/678609505) Can somebody review and approve this PR? It would help us debug this e2e test in the future. In current master, it is impossible to debug the e2e test: https://github.com/apache/flink/pull/11831 > Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end > test fails with no such file > --- > > Key: FLINK-16770 > URL: https://issues.apache.org/jira/browse/FLINK-16770 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Tests >Affects Versions: 1.11.0 >Reporter: Zhijiang >Assignee: Yun Tang >Priority: Blocker > Labels: pull-request-available, test-stability > Fix For: 1.11.0 > > Attachments: e2e-output.log, > flink-vsts-standalonesession-0-fv-az53.log, image-2020-04-16-11-24-54-549.png > > Time Spent: 50m > Remaining Estimate: 0h > > The log : > [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6603=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5] > > There was also the similar problem in > https://issues.apache.org/jira/browse/FLINK-16561, but for the case of no > parallelism change. And this case is for scaling up. Not quite sure whether > the root cause is the same one. > {code:java} > 2020-03-25T06:50:31.3894841Z Running 'Resuming Externalized Checkpoint > (rocks, incremental, scale up) end-to-end test' > 2020-03-25T06:50:31.3895308Z > == > 2020-03-25T06:50:31.3907274Z TEST_DATA_DIR: > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304 > 2020-03-25T06:50:31.5500274Z Flink dist directory: > /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT > 2020-03-25T06:50:31.6354639Z Starting cluster. > 2020-03-25T06:50:31.8871932Z Starting standalonesession daemon on host > fv-az655. > 2020-03-25T06:50:33.5021784Z Starting taskexecutor daemon on host fv-az655. > 2020-03-25T06:50:33.5152274Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:34.5498116Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:35.6031346Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:36.9848425Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:38.0283377Z Dispatcher REST endpoint is up. > 2020-03-25T06:50:38.0285490Z Running externalized checkpoints test, with > ORIGINAL_DOP=2 NEW_DOP=4 and STATE_BACKEND_TYPE=rocks > STATE_BACKEND_FILE_ASYNC=true STATE_BACKEND_ROCKSDB_INCREMENTAL=true > SIMULATE_FAILURE=false ... > 2020-03-25T06:50:46.1754645Z Job (b8cb04e4b1e730585bc616aa352866d0) is > running. > 2020-03-25T06:50:46.1758132Z Waiting for job > (b8cb04e4b1e730585bc616aa352866d0) to have at least 1 completed checkpoints > ... > 2020-03-25T06:50:46.3478276Z Waiting for job to process up to 200 records, > current progress: 173 records ... > 2020-03-25T06:50:49.6332988Z Cancelling job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.4875673Z Cancelled job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.5468230Z ls: cannot access > '/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304/externalized-chckpt-e2e-backend-dir/b8cb04e4b1e730585bc616aa352866d0/chk-[1-9]*/_metadata': > No such file or directory > 2020-03-25T06:50:50.5606260Z Restoring job with externalized checkpoint at . > ... > 2020-03-25T06:50:58.4728245Z > 2020-03-25T06:50:58.4732663Z > > 2020-03-25T06:50:58.4735785Z The program finished with the following > exception: > 2020-03-25T06:50:58.4737759Z > 2020-03-25T06:50:58.4742666Z > org.apache.flink.client.program.ProgramInvocationException: The main method > caused an error: java.util.concurrent.ExecutionException: > org.apache.flink.runtime.client.JobSubmissionException: Failed to submit > JobGraph. > 2020-03-25T06:50:58.4746274Z at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335) > 2020-03-25T06:50:58.4749954Z at > org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205) > 2020-03-25T06:50:58.4752753Z at >
[jira] [Commented] (FLINK-16770) Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end test fails with no such file
[ https://issues.apache.org/jira/browse/FLINK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091543#comment-17091543 ] Piotr Nowojski commented on FLINK-16770: [~rmetzger] how could it happen? I've reverted the cause of this bug over a week ago: https://issues.apache.org/jira/browse/FLINK-14971 Or do we have to revert even more code [~SleePy]? > Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end > test fails with no such file > --- > > Key: FLINK-16770 > URL: https://issues.apache.org/jira/browse/FLINK-16770 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Tests >Affects Versions: 1.11.0 >Reporter: Zhijiang >Assignee: Yun Tang >Priority: Blocker > Labels: pull-request-available, test-stability > Fix For: 1.11.0 > > Attachments: e2e-output.log, > flink-vsts-standalonesession-0-fv-az53.log, image-2020-04-16-11-24-54-549.png > > Time Spent: 50m > Remaining Estimate: 0h > > The log : > [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6603=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5] > > There was also the similar problem in > https://issues.apache.org/jira/browse/FLINK-16561, but for the case of no > parallelism change. And this case is for scaling up. Not quite sure whether > the root cause is the same one. > {code:java} > 2020-03-25T06:50:31.3894841Z Running 'Resuming Externalized Checkpoint > (rocks, incremental, scale up) end-to-end test' > 2020-03-25T06:50:31.3895308Z > == > 2020-03-25T06:50:31.3907274Z TEST_DATA_DIR: > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304 > 2020-03-25T06:50:31.5500274Z Flink dist directory: > /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT > 2020-03-25T06:50:31.6354639Z Starting cluster. > 2020-03-25T06:50:31.8871932Z Starting standalonesession daemon on host > fv-az655. > 2020-03-25T06:50:33.5021784Z Starting taskexecutor daemon on host fv-az655. > 2020-03-25T06:50:33.5152274Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:34.5498116Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:35.6031346Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:36.9848425Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:38.0283377Z Dispatcher REST endpoint is up. > 2020-03-25T06:50:38.0285490Z Running externalized checkpoints test, with > ORIGINAL_DOP=2 NEW_DOP=4 and STATE_BACKEND_TYPE=rocks > STATE_BACKEND_FILE_ASYNC=true STATE_BACKEND_ROCKSDB_INCREMENTAL=true > SIMULATE_FAILURE=false ... > 2020-03-25T06:50:46.1754645Z Job (b8cb04e4b1e730585bc616aa352866d0) is > running. > 2020-03-25T06:50:46.1758132Z Waiting for job > (b8cb04e4b1e730585bc616aa352866d0) to have at least 1 completed checkpoints > ... > 2020-03-25T06:50:46.3478276Z Waiting for job to process up to 200 records, > current progress: 173 records ... > 2020-03-25T06:50:49.6332988Z Cancelling job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.4875673Z Cancelled job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.5468230Z ls: cannot access > '/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304/externalized-chckpt-e2e-backend-dir/b8cb04e4b1e730585bc616aa352866d0/chk-[1-9]*/_metadata': > No such file or directory > 2020-03-25T06:50:50.5606260Z Restoring job with externalized checkpoint at . > ... > 2020-03-25T06:50:58.4728245Z > 2020-03-25T06:50:58.4732663Z > > 2020-03-25T06:50:58.4735785Z The program finished with the following > exception: > 2020-03-25T06:50:58.4737759Z > 2020-03-25T06:50:58.4742666Z > org.apache.flink.client.program.ProgramInvocationException: The main method > caused an error: java.util.concurrent.ExecutionException: > org.apache.flink.runtime.client.JobSubmissionException: Failed to submit > JobGraph. > 2020-03-25T06:50:58.4746274Z at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335) > 2020-03-25T06:50:58.4749954Z at > org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205) > 2020-03-25T06:50:58.4752753Z at > org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:142) > 2020-03-25T06:50:58.4755400Z at > org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:659) > 2020-03-25T06:50:58.4757862Z at > org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:210) >
[jira] [Commented] (FLINK-16770) Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end test fails with no such file
[ https://issues.apache.org/jira/browse/FLINK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091377#comment-17091377 ] Robert Metzger commented on FLINK-16770: Another case: https://api.travis-ci.org/v3/job/678609505/log.txt > Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end > test fails with no such file > --- > > Key: FLINK-16770 > URL: https://issues.apache.org/jira/browse/FLINK-16770 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Tests >Affects Versions: 1.11.0 >Reporter: Zhijiang >Assignee: Yun Tang >Priority: Blocker > Labels: pull-request-available, test-stability > Fix For: 1.11.0 > > Attachments: e2e-output.log, > flink-vsts-standalonesession-0-fv-az53.log, image-2020-04-16-11-24-54-549.png > > Time Spent: 50m > Remaining Estimate: 0h > > The log : > [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6603=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5] > > There was also the similar problem in > https://issues.apache.org/jira/browse/FLINK-16561, but for the case of no > parallelism change. And this case is for scaling up. Not quite sure whether > the root cause is the same one. > {code:java} > 2020-03-25T06:50:31.3894841Z Running 'Resuming Externalized Checkpoint > (rocks, incremental, scale up) end-to-end test' > 2020-03-25T06:50:31.3895308Z > == > 2020-03-25T06:50:31.3907274Z TEST_DATA_DIR: > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304 > 2020-03-25T06:50:31.5500274Z Flink dist directory: > /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT > 2020-03-25T06:50:31.6354639Z Starting cluster. > 2020-03-25T06:50:31.8871932Z Starting standalonesession daemon on host > fv-az655. > 2020-03-25T06:50:33.5021784Z Starting taskexecutor daemon on host fv-az655. > 2020-03-25T06:50:33.5152274Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:34.5498116Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:35.6031346Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:36.9848425Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:38.0283377Z Dispatcher REST endpoint is up. > 2020-03-25T06:50:38.0285490Z Running externalized checkpoints test, with > ORIGINAL_DOP=2 NEW_DOP=4 and STATE_BACKEND_TYPE=rocks > STATE_BACKEND_FILE_ASYNC=true STATE_BACKEND_ROCKSDB_INCREMENTAL=true > SIMULATE_FAILURE=false ... > 2020-03-25T06:50:46.1754645Z Job (b8cb04e4b1e730585bc616aa352866d0) is > running. > 2020-03-25T06:50:46.1758132Z Waiting for job > (b8cb04e4b1e730585bc616aa352866d0) to have at least 1 completed checkpoints > ... > 2020-03-25T06:50:46.3478276Z Waiting for job to process up to 200 records, > current progress: 173 records ... > 2020-03-25T06:50:49.6332988Z Cancelling job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.4875673Z Cancelled job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.5468230Z ls: cannot access > '/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304/externalized-chckpt-e2e-backend-dir/b8cb04e4b1e730585bc616aa352866d0/chk-[1-9]*/_metadata': > No such file or directory > 2020-03-25T06:50:50.5606260Z Restoring job with externalized checkpoint at . > ... > 2020-03-25T06:50:58.4728245Z > 2020-03-25T06:50:58.4732663Z > > 2020-03-25T06:50:58.4735785Z The program finished with the following > exception: > 2020-03-25T06:50:58.4737759Z > 2020-03-25T06:50:58.4742666Z > org.apache.flink.client.program.ProgramInvocationException: The main method > caused an error: java.util.concurrent.ExecutionException: > org.apache.flink.runtime.client.JobSubmissionException: Failed to submit > JobGraph. > 2020-03-25T06:50:58.4746274Z at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335) > 2020-03-25T06:50:58.4749954Z at > org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205) > 2020-03-25T06:50:58.4752753Z at > org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:142) > 2020-03-25T06:50:58.4755400Z at > org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:659) > 2020-03-25T06:50:58.4757862Z at > org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:210) > 2020-03-25T06:50:58.4760282Z at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:890) >
[jira] [Commented] (FLINK-16770) Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end test fails with no such file
[ https://issues.apache.org/jira/browse/FLINK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084901#comment-17084901 ] Stephan Ewen commented on FLINK-16770: -- Thank you all for the great discussion and analysis. I would like to add a few points and suggestions, from the way I understand the problem: h2. There are are two main issues: *(1) Missing ownership in the multi-threaded system. Meaning: Who owns the "Pending Checkpoint during Finalization"?* - It is owned by the CheckpointCoordinator (who aborts it when shutting down) - It is also owned by the I/O Thread or the Completed Checkpoint Store who writes it to ZooKeeper (or similar system). *(2) No Shared Ground Truth between the Checkpoint Coordinator and the JobMaster* - When a checkpoint is finalized, that decision is not consistently visible to the JM. - The JM only sees the result once it is in ZK, which is an asynchronous operation - That causes the final issue described here: possibility that the JM starts from an earlier checkpoint, if a restart happens while the async writing to ZK still happens. - NOTE: It is fine to ignore a checkpoint that was completed, if we did not send "notification complete" and we are sure it will always be ignored. That would be as if the checkpoint never completed. - NOTE: It is not fine to ignore it and start from an earlier checkpoint if it will get committed later. That is the bug to prevent. h2. Two steps to a cleaner solution *(1) When the checkpoint is ready (all tasks acked, metadata written out), Checkpoint Coordinator transfers ownership to the CompletedCheckpointStore.* - That means the Checkpoint is removed from the "Pending Checkpoints" map and added to the CompletedCheckpointStore in one call in the main thread. If this is in one call, it is atomic against other modifications (cancellation, disposing checkpoints). Because the checkpoint is removed from the "Pending Checkpoints" map (not owned by the coordinator any more) it will not get cancelled during shutdown of the coordinator. ==> This is a very simple change *(2) The addition to the CompletedCheckpointStore must be constant time and executed in the main thread* - That means that the CompletedCheckpointStore would put the Completed Checkpoint into a local list and then kick off the asynchronous request to add it to ZK. - If the JM looks up the latest checkpoint, it refers to that local list. That way all local components refer to the same status and do not exchage status asynchronously via an external system (ZK). ==> The change is that the CompletedCheckpointStore would not always repopulate itself from ZK upon "restore checkpoint", but keep the local state and only repopulate itself when the master gains leader status (and clears itself when leader status is lost). ==> This is a slightly more complex change, but not too big. h2. Distributed Races and Corner Cases I think this is an existing corner case issue, not related to this bug, but I list it here, for consistency. It exists, because JM failover can happen concurrently with ZK updates. * Once the call to add the checkpoint to ZK is sent off, the checkpoint might or might not get added to ZK (which is the distributed ground truth). * During that time, we cannot restore at all. ** If the JM already restored form the checkpoint, it sends "restore state" to the tasks, which is equivalent to "notify checkpoint complete" and materializes external side effects. If the addition to ZK then fails and the JM fails and another JM becomes leader, it will restore from an earlier checkpoint ** If the JM restores from an earlier checkpoint during that time, and then the ZK call completes, we have duplicate side effects. * In both cases we get fractured consistency or duplicate side effects I see two possible solutions, both not easy *(a) We cannot restore during the period where the checkpoint is in "uncertain if committed" state* * The CompletedCheckpointStore would need to keep the Checkpoint in a "uncertain" list initially, until the I/O executor call returns from adding the Checkpoint to ZK. * When asking the CompletedCheckpointStore for the latest checkpoint, it returns a CompletableFuture. * While the latest checkpoint is in that list, the future cannot be completed. It completes when the ZK command completes (usually few 100ms). Restore operations would need to wait during that time. * There is a separate issue FLINK-16931 where "loading metadata" for the latest completed checkpoint can take long (seconds), because it is an I/O operations. This sounds like a similar issue, but I fear that the solution is more complex that anticipated in that issue. *(b) Change the contracts with operators that side-effects are never committed during restore.* * Then it is safe to restore already from the operator that is not yet in ZK, because the
[jira] [Commented] (FLINK-16770) Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end test fails with no such file
[ https://issues.apache.org/jira/browse/FLINK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084514#comment-17084514 ] Yun Tang commented on FLINK-16770: -- Update some progress: After discussed with [~pnowojski] and [~SleePy] offline. We have reached the agreement that the idea of "The checkpoint which is doing finalization could NOT be aborted when {{CheckpointCoordinator}} is being shut down or period scheduler is being stopped", which would lead to "checkpoint could complete after job is not in RUNNING status", would violate the EXACTLY_ONCE semantics. Explanation: If chk-3 is already completed, and chk-4 is in the async phase which would create _metadata in the HDFS. The job switches to FAILING, and it restart from chk-3 in checkpoint store. After that, the async thread add the chk-4 into the checkpoint store. Then when it comes back to the main thread, we would treat it as completed, notify chk-4 to all tasks. If we use two-phase commit for those tasks, the data between chk-3 and chk-4 would be processed twice. Below is the current logic after FLINK-14971 : !image-2020-04-16-11-24-54-549.png! If we still want the 100% non-blocking for all checkpoint threads, we need to split the logic of adding new checkpoint and subsuming old checkpoint to make adding could be reverted without deleting old checkpoints by mistake. However, this would change a lot and needs to be reviewed carefully with a design doc. Since Flink-1.11 future freeze time is close, we might look at another solution which could be much simpler. This idea would "revert" some logic introduced in FLINK-14971 to make some blocking actions, for example: share a lock among aborting and completing pending checkpoint in Flink-1.10 did. This could be treated as a compromise to not 100% non-blocking but satisfy EXACTLY_ONCE semantics based on previous commits in FLINK-14971. We currently prefer to the latter one which could be done in a short time, and ask [~SleePy] for help as he is more familiar with previous commits introduced in FLINK-14971. > Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end > test fails with no such file > --- > > Key: FLINK-16770 > URL: https://issues.apache.org/jira/browse/FLINK-16770 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Tests >Affects Versions: 1.11.0 >Reporter: Zhijiang >Assignee: Yun Tang >Priority: Blocker > Labels: pull-request-available, test-stability > Fix For: 1.11.0 > > Attachments: e2e-output.log, > flink-vsts-standalonesession-0-fv-az53.log, image-2020-04-16-11-24-54-549.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > The log : > [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6603=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5] > > There was also the similar problem in > https://issues.apache.org/jira/browse/FLINK-16561, but for the case of no > parallelism change. And this case is for scaling up. Not quite sure whether > the root cause is the same one. > {code:java} > 2020-03-25T06:50:31.3894841Z Running 'Resuming Externalized Checkpoint > (rocks, incremental, scale up) end-to-end test' > 2020-03-25T06:50:31.3895308Z > == > 2020-03-25T06:50:31.3907274Z TEST_DATA_DIR: > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304 > 2020-03-25T06:50:31.5500274Z Flink dist directory: > /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT > 2020-03-25T06:50:31.6354639Z Starting cluster. > 2020-03-25T06:50:31.8871932Z Starting standalonesession daemon on host > fv-az655. > 2020-03-25T06:50:33.5021784Z Starting taskexecutor daemon on host fv-az655. > 2020-03-25T06:50:33.5152274Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:34.5498116Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:35.6031346Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:36.9848425Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:38.0283377Z Dispatcher REST endpoint is up. > 2020-03-25T06:50:38.0285490Z Running externalized checkpoints test, with > ORIGINAL_DOP=2 NEW_DOP=4 and STATE_BACKEND_TYPE=rocks > STATE_BACKEND_FILE_ASYNC=true STATE_BACKEND_ROCKSDB_INCREMENTAL=true > SIMULATE_FAILURE=false ... > 2020-03-25T06:50:46.1754645Z Job (b8cb04e4b1e730585bc616aa352866d0) is > running. > 2020-03-25T06:50:46.1758132Z Waiting for job > (b8cb04e4b1e730585bc616aa352866d0) to have at least 1 completed checkpoints > ... > 2020-03-25T06:50:46.3478276Z
[jira] [Commented] (FLINK-16770) Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end test fails with no such file
[ https://issues.apache.org/jira/browse/FLINK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17083325#comment-17083325 ] Biao Liu commented on FLINK-16770: -- Thanks [~rmetzger] for reminding. [~yunta] good job, please give me a feedback if you need any help. > Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end > test fails with no such file > --- > > Key: FLINK-16770 > URL: https://issues.apache.org/jira/browse/FLINK-16770 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Tests >Affects Versions: 1.11.0 >Reporter: Zhijiang >Assignee: Yun Tang >Priority: Blocker > Labels: pull-request-available, test-stability > Fix For: 1.11.0 > > Attachments: e2e-output.log, > flink-vsts-standalonesession-0-fv-az53.log > > Time Spent: 20m > Remaining Estimate: 0h > > The log : > [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6603=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5] > > There was also the similar problem in > https://issues.apache.org/jira/browse/FLINK-16561, but for the case of no > parallelism change. And this case is for scaling up. Not quite sure whether > the root cause is the same one. > {code:java} > 2020-03-25T06:50:31.3894841Z Running 'Resuming Externalized Checkpoint > (rocks, incremental, scale up) end-to-end test' > 2020-03-25T06:50:31.3895308Z > == > 2020-03-25T06:50:31.3907274Z TEST_DATA_DIR: > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304 > 2020-03-25T06:50:31.5500274Z Flink dist directory: > /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT > 2020-03-25T06:50:31.6354639Z Starting cluster. > 2020-03-25T06:50:31.8871932Z Starting standalonesession daemon on host > fv-az655. > 2020-03-25T06:50:33.5021784Z Starting taskexecutor daemon on host fv-az655. > 2020-03-25T06:50:33.5152274Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:34.5498116Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:35.6031346Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:36.9848425Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:38.0283377Z Dispatcher REST endpoint is up. > 2020-03-25T06:50:38.0285490Z Running externalized checkpoints test, with > ORIGINAL_DOP=2 NEW_DOP=4 and STATE_BACKEND_TYPE=rocks > STATE_BACKEND_FILE_ASYNC=true STATE_BACKEND_ROCKSDB_INCREMENTAL=true > SIMULATE_FAILURE=false ... > 2020-03-25T06:50:46.1754645Z Job (b8cb04e4b1e730585bc616aa352866d0) is > running. > 2020-03-25T06:50:46.1758132Z Waiting for job > (b8cb04e4b1e730585bc616aa352866d0) to have at least 1 completed checkpoints > ... > 2020-03-25T06:50:46.3478276Z Waiting for job to process up to 200 records, > current progress: 173 records ... > 2020-03-25T06:50:49.6332988Z Cancelling job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.4875673Z Cancelled job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.5468230Z ls: cannot access > '/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304/externalized-chckpt-e2e-backend-dir/b8cb04e4b1e730585bc616aa352866d0/chk-[1-9]*/_metadata': > No such file or directory > 2020-03-25T06:50:50.5606260Z Restoring job with externalized checkpoint at . > ... > 2020-03-25T06:50:58.4728245Z > 2020-03-25T06:50:58.4732663Z > > 2020-03-25T06:50:58.4735785Z The program finished with the following > exception: > 2020-03-25T06:50:58.4737759Z > 2020-03-25T06:50:58.4742666Z > org.apache.flink.client.program.ProgramInvocationException: The main method > caused an error: java.util.concurrent.ExecutionException: > org.apache.flink.runtime.client.JobSubmissionException: Failed to submit > JobGraph. > 2020-03-25T06:50:58.4746274Z at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335) > 2020-03-25T06:50:58.4749954Z at > org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205) > 2020-03-25T06:50:58.4752753Z at > org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:142) > 2020-03-25T06:50:58.4755400Z at > org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:659) > 2020-03-25T06:50:58.4757862Z at > org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:210) > 2020-03-25T06:50:58.4760282Z at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:890) > 2020-03-25T06:50:58.4763591Z at
[jira] [Commented] (FLINK-16770) Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end test fails with no such file
[ https://issues.apache.org/jira/browse/FLINK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17083260#comment-17083260 ] Yun Tang commented on FLINK-16770: -- [~SleePy] After looking at logs of [~aljoscha] 's instance, I think this should be the same cause. The job cancelled and did not know checkpoint-1 has been completed: {code:bash} 05:25:23,676 [flink-akka.actor.default-dispatcher-2] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Triggering checkpoint 1 @1586841923675 for job e2e741b6dcacdac50afd2830c8a6892d. 05:25:23,824 [Source: Custom Source (1/1)] WARN org.apache.flink.runtime.taskmanager.Task [] - Source: Custom Source (1/1) (73b5d00dc26989a5c2226f19ee14ea94) switched from RUNNING to FAILED. {code} But it would then recover form the checkpoint-1, which means the checkpoint store already contain that checkpoint-1 which is added in the async-io thread. {code:bash} 05:25:23,835 [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job Tumbling Window Test (e2e741b6dcacdac50afd2830c8a6892d) switched from state RESTARTING to RUNNING. 05:25:23,835 [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Restoring job e2e741b6dcacdac50afd2830c8a6892d from latest valid checkpoint: Checkpoint 1 @ 1586841923675 for e2e741b6dcacdac50afd2830c8a6892d. {code} BTW, somehow I cannot receive any notification from JIRA recently, and I have to check issues manually periodically, sorry for late reply. > Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end > test fails with no such file > --- > > Key: FLINK-16770 > URL: https://issues.apache.org/jira/browse/FLINK-16770 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Tests >Affects Versions: 1.11.0 >Reporter: Zhijiang >Assignee: Yun Tang >Priority: Blocker > Labels: pull-request-available, test-stability > Fix For: 1.11.0 > > Attachments: e2e-output.log, > flink-vsts-standalonesession-0-fv-az53.log > > Time Spent: 20m > Remaining Estimate: 0h > > The log : > [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6603=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5] > > There was also the similar problem in > https://issues.apache.org/jira/browse/FLINK-16561, but for the case of no > parallelism change. And this case is for scaling up. Not quite sure whether > the root cause is the same one. > {code:java} > 2020-03-25T06:50:31.3894841Z Running 'Resuming Externalized Checkpoint > (rocks, incremental, scale up) end-to-end test' > 2020-03-25T06:50:31.3895308Z > == > 2020-03-25T06:50:31.3907274Z TEST_DATA_DIR: > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304 > 2020-03-25T06:50:31.5500274Z Flink dist directory: > /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT > 2020-03-25T06:50:31.6354639Z Starting cluster. > 2020-03-25T06:50:31.8871932Z Starting standalonesession daemon on host > fv-az655. > 2020-03-25T06:50:33.5021784Z Starting taskexecutor daemon on host fv-az655. > 2020-03-25T06:50:33.5152274Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:34.5498116Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:35.6031346Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:36.9848425Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:38.0283377Z Dispatcher REST endpoint is up. > 2020-03-25T06:50:38.0285490Z Running externalized checkpoints test, with > ORIGINAL_DOP=2 NEW_DOP=4 and STATE_BACKEND_TYPE=rocks > STATE_BACKEND_FILE_ASYNC=true STATE_BACKEND_ROCKSDB_INCREMENTAL=true > SIMULATE_FAILURE=false ... > 2020-03-25T06:50:46.1754645Z Job (b8cb04e4b1e730585bc616aa352866d0) is > running. > 2020-03-25T06:50:46.1758132Z Waiting for job > (b8cb04e4b1e730585bc616aa352866d0) to have at least 1 completed checkpoints > ... > 2020-03-25T06:50:46.3478276Z Waiting for job to process up to 200 records, > current progress: 173 records ... > 2020-03-25T06:50:49.6332988Z Cancelling job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.4875673Z Cancelled job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.5468230Z ls: cannot access > '/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304/externalized-chckpt-e2e-backend-dir/b8cb04e4b1e730585bc616aa352866d0/chk-[1-9]*/_metadata': > No such file or directory > 2020-03-25T06:50:50.5606260Z
[jira] [Commented] (FLINK-16770) Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end test fails with no such file
[ https://issues.apache.org/jira/browse/FLINK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17083214#comment-17083214 ] Robert Metzger commented on FLINK-16770: You don't need to rely on transfer.sh anymore. We are storing the logs also in Azure Pipelines. Each build as a set of artifacts associated with it. In case of the build Aljoscha posted, you need to download the file with the "-tests" suffix: https://dev.azure.com/rmetzger/Flink/_build/results?buildId=7418=artifacts=publishedArtifacts > Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end > test fails with no such file > --- > > Key: FLINK-16770 > URL: https://issues.apache.org/jira/browse/FLINK-16770 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Tests >Affects Versions: 1.11.0 >Reporter: Zhijiang >Assignee: Yun Tang >Priority: Blocker > Labels: pull-request-available, test-stability > Fix For: 1.11.0 > > Attachments: e2e-output.log, > flink-vsts-standalonesession-0-fv-az53.log > > Time Spent: 20m > Remaining Estimate: 0h > > The log : > [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6603=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5] > > There was also the similar problem in > https://issues.apache.org/jira/browse/FLINK-16561, but for the case of no > parallelism change. And this case is for scaling up. Not quite sure whether > the root cause is the same one. > {code:java} > 2020-03-25T06:50:31.3894841Z Running 'Resuming Externalized Checkpoint > (rocks, incremental, scale up) end-to-end test' > 2020-03-25T06:50:31.3895308Z > == > 2020-03-25T06:50:31.3907274Z TEST_DATA_DIR: > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304 > 2020-03-25T06:50:31.5500274Z Flink dist directory: > /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT > 2020-03-25T06:50:31.6354639Z Starting cluster. > 2020-03-25T06:50:31.8871932Z Starting standalonesession daemon on host > fv-az655. > 2020-03-25T06:50:33.5021784Z Starting taskexecutor daemon on host fv-az655. > 2020-03-25T06:50:33.5152274Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:34.5498116Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:35.6031346Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:36.9848425Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:38.0283377Z Dispatcher REST endpoint is up. > 2020-03-25T06:50:38.0285490Z Running externalized checkpoints test, with > ORIGINAL_DOP=2 NEW_DOP=4 and STATE_BACKEND_TYPE=rocks > STATE_BACKEND_FILE_ASYNC=true STATE_BACKEND_ROCKSDB_INCREMENTAL=true > SIMULATE_FAILURE=false ... > 2020-03-25T06:50:46.1754645Z Job (b8cb04e4b1e730585bc616aa352866d0) is > running. > 2020-03-25T06:50:46.1758132Z Waiting for job > (b8cb04e4b1e730585bc616aa352866d0) to have at least 1 completed checkpoints > ... > 2020-03-25T06:50:46.3478276Z Waiting for job to process up to 200 records, > current progress: 173 records ... > 2020-03-25T06:50:49.6332988Z Cancelling job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.4875673Z Cancelled job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.5468230Z ls: cannot access > '/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304/externalized-chckpt-e2e-backend-dir/b8cb04e4b1e730585bc616aa352866d0/chk-[1-9]*/_metadata': > No such file or directory > 2020-03-25T06:50:50.5606260Z Restoring job with externalized checkpoint at . > ... > 2020-03-25T06:50:58.4728245Z > 2020-03-25T06:50:58.4732663Z > > 2020-03-25T06:50:58.4735785Z The program finished with the following > exception: > 2020-03-25T06:50:58.4737759Z > 2020-03-25T06:50:58.4742666Z > org.apache.flink.client.program.ProgramInvocationException: The main method > caused an error: java.util.concurrent.ExecutionException: > org.apache.flink.runtime.client.JobSubmissionException: Failed to submit > JobGraph. > 2020-03-25T06:50:58.4746274Z at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335) > 2020-03-25T06:50:58.4749954Z at > org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205) > 2020-03-25T06:50:58.4752753Z at > org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:142) > 2020-03-25T06:50:58.4755400Z at > org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:659) >
[jira] [Commented] (FLINK-16770) Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end test fails with no such file
[ https://issues.apache.org/jira/browse/FLINK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17083018#comment-17083018 ] Biao Liu commented on FLINK-16770: -- [~aljoscha], the uploading to transfer.sh failed, I can't confirm the root cause. It might be the same reason. > Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end > test fails with no such file > --- > > Key: FLINK-16770 > URL: https://issues.apache.org/jira/browse/FLINK-16770 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Tests >Affects Versions: 1.11.0 >Reporter: Zhijiang >Assignee: Yun Tang >Priority: Blocker > Labels: pull-request-available, test-stability > Fix For: 1.11.0 > > Attachments: e2e-output.log, > flink-vsts-standalonesession-0-fv-az53.log > > Time Spent: 20m > Remaining Estimate: 0h > > The log : > [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6603=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5] > > There was also the similar problem in > https://issues.apache.org/jira/browse/FLINK-16561, but for the case of no > parallelism change. And this case is for scaling up. Not quite sure whether > the root cause is the same one. > {code:java} > 2020-03-25T06:50:31.3894841Z Running 'Resuming Externalized Checkpoint > (rocks, incremental, scale up) end-to-end test' > 2020-03-25T06:50:31.3895308Z > == > 2020-03-25T06:50:31.3907274Z TEST_DATA_DIR: > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304 > 2020-03-25T06:50:31.5500274Z Flink dist directory: > /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT > 2020-03-25T06:50:31.6354639Z Starting cluster. > 2020-03-25T06:50:31.8871932Z Starting standalonesession daemon on host > fv-az655. > 2020-03-25T06:50:33.5021784Z Starting taskexecutor daemon on host fv-az655. > 2020-03-25T06:50:33.5152274Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:34.5498116Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:35.6031346Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:36.9848425Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:38.0283377Z Dispatcher REST endpoint is up. > 2020-03-25T06:50:38.0285490Z Running externalized checkpoints test, with > ORIGINAL_DOP=2 NEW_DOP=4 and STATE_BACKEND_TYPE=rocks > STATE_BACKEND_FILE_ASYNC=true STATE_BACKEND_ROCKSDB_INCREMENTAL=true > SIMULATE_FAILURE=false ... > 2020-03-25T06:50:46.1754645Z Job (b8cb04e4b1e730585bc616aa352866d0) is > running. > 2020-03-25T06:50:46.1758132Z Waiting for job > (b8cb04e4b1e730585bc616aa352866d0) to have at least 1 completed checkpoints > ... > 2020-03-25T06:50:46.3478276Z Waiting for job to process up to 200 records, > current progress: 173 records ... > 2020-03-25T06:50:49.6332988Z Cancelling job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.4875673Z Cancelled job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.5468230Z ls: cannot access > '/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304/externalized-chckpt-e2e-backend-dir/b8cb04e4b1e730585bc616aa352866d0/chk-[1-9]*/_metadata': > No such file or directory > 2020-03-25T06:50:50.5606260Z Restoring job with externalized checkpoint at . > ... > 2020-03-25T06:50:58.4728245Z > 2020-03-25T06:50:58.4732663Z > > 2020-03-25T06:50:58.4735785Z The program finished with the following > exception: > 2020-03-25T06:50:58.4737759Z > 2020-03-25T06:50:58.4742666Z > org.apache.flink.client.program.ProgramInvocationException: The main method > caused an error: java.util.concurrent.ExecutionException: > org.apache.flink.runtime.client.JobSubmissionException: Failed to submit > JobGraph. > 2020-03-25T06:50:58.4746274Z at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335) > 2020-03-25T06:50:58.4749954Z at > org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205) > 2020-03-25T06:50:58.4752753Z at > org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:142) > 2020-03-25T06:50:58.4755400Z at > org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:659) > 2020-03-25T06:50:58.4757862Z at > org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:210) > 2020-03-25T06:50:58.4760282Z at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:890) >
[jira] [Commented] (FLINK-16770) Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end test fails with no such file
[ https://issues.apache.org/jira/browse/FLINK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17082921#comment-17082921 ] Aljoscha Krettek commented on FLINK-16770: -- Another occurrence of the general problem: https://dev.azure.com/rmetzger/Flink/_build/results?buildId=7418=logs=5c8e7682-d68f-54d1-16a2-a09310218a49=643a4312-2b8f-5e76-5975-7bbc0942470d=3814 > Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end > test fails with no such file > --- > > Key: FLINK-16770 > URL: https://issues.apache.org/jira/browse/FLINK-16770 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Tests >Affects Versions: 1.11.0 >Reporter: Zhijiang >Assignee: Yun Tang >Priority: Blocker > Labels: pull-request-available, test-stability > Fix For: 1.11.0 > > Attachments: e2e-output.log, > flink-vsts-standalonesession-0-fv-az53.log > > Time Spent: 20m > Remaining Estimate: 0h > > The log : > [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6603=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5] > > There was also the similar problem in > https://issues.apache.org/jira/browse/FLINK-16561, but for the case of no > parallelism change. And this case is for scaling up. Not quite sure whether > the root cause is the same one. > {code:java} > 2020-03-25T06:50:31.3894841Z Running 'Resuming Externalized Checkpoint > (rocks, incremental, scale up) end-to-end test' > 2020-03-25T06:50:31.3895308Z > == > 2020-03-25T06:50:31.3907274Z TEST_DATA_DIR: > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304 > 2020-03-25T06:50:31.5500274Z Flink dist directory: > /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT > 2020-03-25T06:50:31.6354639Z Starting cluster. > 2020-03-25T06:50:31.8871932Z Starting standalonesession daemon on host > fv-az655. > 2020-03-25T06:50:33.5021784Z Starting taskexecutor daemon on host fv-az655. > 2020-03-25T06:50:33.5152274Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:34.5498116Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:35.6031346Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:36.9848425Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:38.0283377Z Dispatcher REST endpoint is up. > 2020-03-25T06:50:38.0285490Z Running externalized checkpoints test, with > ORIGINAL_DOP=2 NEW_DOP=4 and STATE_BACKEND_TYPE=rocks > STATE_BACKEND_FILE_ASYNC=true STATE_BACKEND_ROCKSDB_INCREMENTAL=true > SIMULATE_FAILURE=false ... > 2020-03-25T06:50:46.1754645Z Job (b8cb04e4b1e730585bc616aa352866d0) is > running. > 2020-03-25T06:50:46.1758132Z Waiting for job > (b8cb04e4b1e730585bc616aa352866d0) to have at least 1 completed checkpoints > ... > 2020-03-25T06:50:46.3478276Z Waiting for job to process up to 200 records, > current progress: 173 records ... > 2020-03-25T06:50:49.6332988Z Cancelling job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.4875673Z Cancelled job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.5468230Z ls: cannot access > '/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304/externalized-chckpt-e2e-backend-dir/b8cb04e4b1e730585bc616aa352866d0/chk-[1-9]*/_metadata': > No such file or directory > 2020-03-25T06:50:50.5606260Z Restoring job with externalized checkpoint at . > ... > 2020-03-25T06:50:58.4728245Z > 2020-03-25T06:50:58.4732663Z > > 2020-03-25T06:50:58.4735785Z The program finished with the following > exception: > 2020-03-25T06:50:58.4737759Z > 2020-03-25T06:50:58.4742666Z > org.apache.flink.client.program.ProgramInvocationException: The main method > caused an error: java.util.concurrent.ExecutionException: > org.apache.flink.runtime.client.JobSubmissionException: Failed to submit > JobGraph. > 2020-03-25T06:50:58.4746274Z at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335) > 2020-03-25T06:50:58.4749954Z at > org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205) > 2020-03-25T06:50:58.4752753Z at > org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:142) > 2020-03-25T06:50:58.4755400Z at > org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:659) > 2020-03-25T06:50:58.4757862Z at > org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:210) > 2020-03-25T06:50:58.4760282Z at >
[jira] [Commented] (FLINK-16770) Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end test fails with no such file
[ https://issues.apache.org/jira/browse/FLINK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078067#comment-17078067 ] Biao Liu commented on FLINK-16770: -- To [~rmetzger], I think FLINK-16423 and this ticket fail in same scenario. To be short, the atomicity of finalizing a checkpoint is broken. I wrote a comment in FLINK-16423. > Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end > test fails with no such file > --- > > Key: FLINK-16770 > URL: https://issues.apache.org/jira/browse/FLINK-16770 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Tests >Affects Versions: 1.11.0 >Reporter: Zhijiang >Assignee: Yun Tang >Priority: Blocker > Labels: pull-request-available, test-stability > Fix For: 1.11.0 > > Attachments: e2e-output.log, > flink-vsts-standalonesession-0-fv-az53.log > > Time Spent: 20m > Remaining Estimate: 0h > > The log : > [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6603=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5] > > There was also the similar problem in > https://issues.apache.org/jira/browse/FLINK-16561, but for the case of no > parallelism change. And this case is for scaling up. Not quite sure whether > the root cause is the same one. > {code:java} > 2020-03-25T06:50:31.3894841Z Running 'Resuming Externalized Checkpoint > (rocks, incremental, scale up) end-to-end test' > 2020-03-25T06:50:31.3895308Z > == > 2020-03-25T06:50:31.3907274Z TEST_DATA_DIR: > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304 > 2020-03-25T06:50:31.5500274Z Flink dist directory: > /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT > 2020-03-25T06:50:31.6354639Z Starting cluster. > 2020-03-25T06:50:31.8871932Z Starting standalonesession daemon on host > fv-az655. > 2020-03-25T06:50:33.5021784Z Starting taskexecutor daemon on host fv-az655. > 2020-03-25T06:50:33.5152274Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:34.5498116Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:35.6031346Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:36.9848425Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:38.0283377Z Dispatcher REST endpoint is up. > 2020-03-25T06:50:38.0285490Z Running externalized checkpoints test, with > ORIGINAL_DOP=2 NEW_DOP=4 and STATE_BACKEND_TYPE=rocks > STATE_BACKEND_FILE_ASYNC=true STATE_BACKEND_ROCKSDB_INCREMENTAL=true > SIMULATE_FAILURE=false ... > 2020-03-25T06:50:46.1754645Z Job (b8cb04e4b1e730585bc616aa352866d0) is > running. > 2020-03-25T06:50:46.1758132Z Waiting for job > (b8cb04e4b1e730585bc616aa352866d0) to have at least 1 completed checkpoints > ... > 2020-03-25T06:50:46.3478276Z Waiting for job to process up to 200 records, > current progress: 173 records ... > 2020-03-25T06:50:49.6332988Z Cancelling job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.4875673Z Cancelled job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.5468230Z ls: cannot access > '/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304/externalized-chckpt-e2e-backend-dir/b8cb04e4b1e730585bc616aa352866d0/chk-[1-9]*/_metadata': > No such file or directory > 2020-03-25T06:50:50.5606260Z Restoring job with externalized checkpoint at . > ... > 2020-03-25T06:50:58.4728245Z > 2020-03-25T06:50:58.4732663Z > > 2020-03-25T06:50:58.4735785Z The program finished with the following > exception: > 2020-03-25T06:50:58.4737759Z > 2020-03-25T06:50:58.4742666Z > org.apache.flink.client.program.ProgramInvocationException: The main method > caused an error: java.util.concurrent.ExecutionException: > org.apache.flink.runtime.client.JobSubmissionException: Failed to submit > JobGraph. > 2020-03-25T06:50:58.4746274Z at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335) > 2020-03-25T06:50:58.4749954Z at > org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205) > 2020-03-25T06:50:58.4752753Z at > org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:142) > 2020-03-25T06:50:58.4755400Z at > org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:659) > 2020-03-25T06:50:58.4757862Z at > org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:210) > 2020-03-25T06:50:58.4760282Z at >
[jira] [Commented] (FLINK-16770) Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end test fails with no such file
[ https://issues.apache.org/jira/browse/FLINK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078022#comment-17078022 ] Robert Metzger commented on FLINK-16770: Is this end to end test failing because of this issue FLINK-16423? > Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end > test fails with no such file > --- > > Key: FLINK-16770 > URL: https://issues.apache.org/jira/browse/FLINK-16770 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Tests >Affects Versions: 1.11.0 >Reporter: Zhijiang >Assignee: Yun Tang >Priority: Blocker > Labels: pull-request-available, test-stability > Fix For: 1.11.0 > > Attachments: e2e-output.log, > flink-vsts-standalonesession-0-fv-az53.log > > Time Spent: 20m > Remaining Estimate: 0h > > The log : > [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6603=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5] > > There was also the similar problem in > https://issues.apache.org/jira/browse/FLINK-16561, but for the case of no > parallelism change. And this case is for scaling up. Not quite sure whether > the root cause is the same one. > {code:java} > 2020-03-25T06:50:31.3894841Z Running 'Resuming Externalized Checkpoint > (rocks, incremental, scale up) end-to-end test' > 2020-03-25T06:50:31.3895308Z > == > 2020-03-25T06:50:31.3907274Z TEST_DATA_DIR: > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304 > 2020-03-25T06:50:31.5500274Z Flink dist directory: > /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT > 2020-03-25T06:50:31.6354639Z Starting cluster. > 2020-03-25T06:50:31.8871932Z Starting standalonesession daemon on host > fv-az655. > 2020-03-25T06:50:33.5021784Z Starting taskexecutor daemon on host fv-az655. > 2020-03-25T06:50:33.5152274Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:34.5498116Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:35.6031346Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:36.9848425Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:38.0283377Z Dispatcher REST endpoint is up. > 2020-03-25T06:50:38.0285490Z Running externalized checkpoints test, with > ORIGINAL_DOP=2 NEW_DOP=4 and STATE_BACKEND_TYPE=rocks > STATE_BACKEND_FILE_ASYNC=true STATE_BACKEND_ROCKSDB_INCREMENTAL=true > SIMULATE_FAILURE=false ... > 2020-03-25T06:50:46.1754645Z Job (b8cb04e4b1e730585bc616aa352866d0) is > running. > 2020-03-25T06:50:46.1758132Z Waiting for job > (b8cb04e4b1e730585bc616aa352866d0) to have at least 1 completed checkpoints > ... > 2020-03-25T06:50:46.3478276Z Waiting for job to process up to 200 records, > current progress: 173 records ... > 2020-03-25T06:50:49.6332988Z Cancelling job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.4875673Z Cancelled job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.5468230Z ls: cannot access > '/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304/externalized-chckpt-e2e-backend-dir/b8cb04e4b1e730585bc616aa352866d0/chk-[1-9]*/_metadata': > No such file or directory > 2020-03-25T06:50:50.5606260Z Restoring job with externalized checkpoint at . > ... > 2020-03-25T06:50:58.4728245Z > 2020-03-25T06:50:58.4732663Z > > 2020-03-25T06:50:58.4735785Z The program finished with the following > exception: > 2020-03-25T06:50:58.4737759Z > 2020-03-25T06:50:58.4742666Z > org.apache.flink.client.program.ProgramInvocationException: The main method > caused an error: java.util.concurrent.ExecutionException: > org.apache.flink.runtime.client.JobSubmissionException: Failed to submit > JobGraph. > 2020-03-25T06:50:58.4746274Z at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335) > 2020-03-25T06:50:58.4749954Z at > org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205) > 2020-03-25T06:50:58.4752753Z at > org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:142) > 2020-03-25T06:50:58.4755400Z at > org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:659) > 2020-03-25T06:50:58.4757862Z at > org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:210) > 2020-03-25T06:50:58.4760282Z at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:890) > 2020-03-25T06:50:58.4763591Z at >
[jira] [Commented] (FLINK-16770) Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end test fails with no such file
[ https://issues.apache.org/jira/browse/FLINK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076161#comment-17076161 ] Biao Liu commented on FLINK-16770: -- Thanks [~rmetzger] for manually verifying and merging the PR. > Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end > test fails with no such file > --- > > Key: FLINK-16770 > URL: https://issues.apache.org/jira/browse/FLINK-16770 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Tests >Affects Versions: 1.11.0 >Reporter: Zhijiang >Assignee: Yun Tang >Priority: Blocker > Labels: pull-request-available, test-stability > Fix For: 1.11.0 > > Attachments: e2e-output.log, > flink-vsts-standalonesession-0-fv-az53.log > > Time Spent: 20m > Remaining Estimate: 0h > > The log : > [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6603=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5] > > There was also the similar problem in > https://issues.apache.org/jira/browse/FLINK-16561, but for the case of no > parallelism change. And this case is for scaling up. Not quite sure whether > the root cause is the same one. > {code:java} > 2020-03-25T06:50:31.3894841Z Running 'Resuming Externalized Checkpoint > (rocks, incremental, scale up) end-to-end test' > 2020-03-25T06:50:31.3895308Z > == > 2020-03-25T06:50:31.3907274Z TEST_DATA_DIR: > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304 > 2020-03-25T06:50:31.5500274Z Flink dist directory: > /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT > 2020-03-25T06:50:31.6354639Z Starting cluster. > 2020-03-25T06:50:31.8871932Z Starting standalonesession daemon on host > fv-az655. > 2020-03-25T06:50:33.5021784Z Starting taskexecutor daemon on host fv-az655. > 2020-03-25T06:50:33.5152274Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:34.5498116Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:35.6031346Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:36.9848425Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:38.0283377Z Dispatcher REST endpoint is up. > 2020-03-25T06:50:38.0285490Z Running externalized checkpoints test, with > ORIGINAL_DOP=2 NEW_DOP=4 and STATE_BACKEND_TYPE=rocks > STATE_BACKEND_FILE_ASYNC=true STATE_BACKEND_ROCKSDB_INCREMENTAL=true > SIMULATE_FAILURE=false ... > 2020-03-25T06:50:46.1754645Z Job (b8cb04e4b1e730585bc616aa352866d0) is > running. > 2020-03-25T06:50:46.1758132Z Waiting for job > (b8cb04e4b1e730585bc616aa352866d0) to have at least 1 completed checkpoints > ... > 2020-03-25T06:50:46.3478276Z Waiting for job to process up to 200 records, > current progress: 173 records ... > 2020-03-25T06:50:49.6332988Z Cancelling job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.4875673Z Cancelled job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.5468230Z ls: cannot access > '/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304/externalized-chckpt-e2e-backend-dir/b8cb04e4b1e730585bc616aa352866d0/chk-[1-9]*/_metadata': > No such file or directory > 2020-03-25T06:50:50.5606260Z Restoring job with externalized checkpoint at . > ... > 2020-03-25T06:50:58.4728245Z > 2020-03-25T06:50:58.4732663Z > > 2020-03-25T06:50:58.4735785Z The program finished with the following > exception: > 2020-03-25T06:50:58.4737759Z > 2020-03-25T06:50:58.4742666Z > org.apache.flink.client.program.ProgramInvocationException: The main method > caused an error: java.util.concurrent.ExecutionException: > org.apache.flink.runtime.client.JobSubmissionException: Failed to submit > JobGraph. > 2020-03-25T06:50:58.4746274Z at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335) > 2020-03-25T06:50:58.4749954Z at > org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205) > 2020-03-25T06:50:58.4752753Z at > org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:142) > 2020-03-25T06:50:58.4755400Z at > org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:659) > 2020-03-25T06:50:58.4757862Z at > org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:210) > 2020-03-25T06:50:58.4760282Z at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:890) > 2020-03-25T06:50:58.4763591Z at >
[jira] [Commented] (FLINK-16770) Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end test fails with no such file
[ https://issues.apache.org/jira/browse/FLINK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076080#comment-17076080 ] Robert Metzger commented on FLINK-16770: Temporary hotfix merged in dd9f9bf040cb82ed7e18c9fdf7c7e1ca6f43f896. Thank you guys for working on this! > Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end > test fails with no such file > --- > > Key: FLINK-16770 > URL: https://issues.apache.org/jira/browse/FLINK-16770 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Tests >Affects Versions: 1.11.0 >Reporter: Zhijiang >Assignee: Yun Tang >Priority: Blocker > Labels: pull-request-available, test-stability > Fix For: 1.11.0 > > Attachments: e2e-output.log, > flink-vsts-standalonesession-0-fv-az53.log > > Time Spent: 20m > Remaining Estimate: 0h > > The log : > [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6603=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5] > > There was also the similar problem in > https://issues.apache.org/jira/browse/FLINK-16561, but for the case of no > parallelism change. And this case is for scaling up. Not quite sure whether > the root cause is the same one. > {code:java} > 2020-03-25T06:50:31.3894841Z Running 'Resuming Externalized Checkpoint > (rocks, incremental, scale up) end-to-end test' > 2020-03-25T06:50:31.3895308Z > == > 2020-03-25T06:50:31.3907274Z TEST_DATA_DIR: > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304 > 2020-03-25T06:50:31.5500274Z Flink dist directory: > /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT > 2020-03-25T06:50:31.6354639Z Starting cluster. > 2020-03-25T06:50:31.8871932Z Starting standalonesession daemon on host > fv-az655. > 2020-03-25T06:50:33.5021784Z Starting taskexecutor daemon on host fv-az655. > 2020-03-25T06:50:33.5152274Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:34.5498116Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:35.6031346Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:36.9848425Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:38.0283377Z Dispatcher REST endpoint is up. > 2020-03-25T06:50:38.0285490Z Running externalized checkpoints test, with > ORIGINAL_DOP=2 NEW_DOP=4 and STATE_BACKEND_TYPE=rocks > STATE_BACKEND_FILE_ASYNC=true STATE_BACKEND_ROCKSDB_INCREMENTAL=true > SIMULATE_FAILURE=false ... > 2020-03-25T06:50:46.1754645Z Job (b8cb04e4b1e730585bc616aa352866d0) is > running. > 2020-03-25T06:50:46.1758132Z Waiting for job > (b8cb04e4b1e730585bc616aa352866d0) to have at least 1 completed checkpoints > ... > 2020-03-25T06:50:46.3478276Z Waiting for job to process up to 200 records, > current progress: 173 records ... > 2020-03-25T06:50:49.6332988Z Cancelling job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.4875673Z Cancelled job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.5468230Z ls: cannot access > '/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304/externalized-chckpt-e2e-backend-dir/b8cb04e4b1e730585bc616aa352866d0/chk-[1-9]*/_metadata': > No such file or directory > 2020-03-25T06:50:50.5606260Z Restoring job with externalized checkpoint at . > ... > 2020-03-25T06:50:58.4728245Z > 2020-03-25T06:50:58.4732663Z > > 2020-03-25T06:50:58.4735785Z The program finished with the following > exception: > 2020-03-25T06:50:58.4737759Z > 2020-03-25T06:50:58.4742666Z > org.apache.flink.client.program.ProgramInvocationException: The main method > caused an error: java.util.concurrent.ExecutionException: > org.apache.flink.runtime.client.JobSubmissionException: Failed to submit > JobGraph. > 2020-03-25T06:50:58.4746274Z at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335) > 2020-03-25T06:50:58.4749954Z at > org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205) > 2020-03-25T06:50:58.4752753Z at > org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:142) > 2020-03-25T06:50:58.4755400Z at > org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:659) > 2020-03-25T06:50:58.4757862Z at > org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:210) > 2020-03-25T06:50:58.4760282Z at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:890) >
[jira] [Commented] (FLINK-16770) Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end test fails with no such file
[ https://issues.apache.org/jira/browse/FLINK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074561#comment-17074561 ] Piotr Nowojski commented on FLINK-16770: Thanks [~yunta] for analysing the issue and [~SleePy] for confirming it. > Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end > test fails with no such file > --- > > Key: FLINK-16770 > URL: https://issues.apache.org/jira/browse/FLINK-16770 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Tests >Affects Versions: 1.11.0 >Reporter: Zhijiang >Assignee: Yun Tang >Priority: Blocker > Labels: pull-request-available, test-stability > Fix For: 1.11.0 > > Attachments: e2e-output.log, > flink-vsts-standalonesession-0-fv-az53.log > > Time Spent: 10m > Remaining Estimate: 0h > > The log : > [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6603=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5] > > There was also the similar problem in > https://issues.apache.org/jira/browse/FLINK-16561, but for the case of no > parallelism change. And this case is for scaling up. Not quite sure whether > the root cause is the same one. > {code:java} > 2020-03-25T06:50:31.3894841Z Running 'Resuming Externalized Checkpoint > (rocks, incremental, scale up) end-to-end test' > 2020-03-25T06:50:31.3895308Z > == > 2020-03-25T06:50:31.3907274Z TEST_DATA_DIR: > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304 > 2020-03-25T06:50:31.5500274Z Flink dist directory: > /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT > 2020-03-25T06:50:31.6354639Z Starting cluster. > 2020-03-25T06:50:31.8871932Z Starting standalonesession daemon on host > fv-az655. > 2020-03-25T06:50:33.5021784Z Starting taskexecutor daemon on host fv-az655. > 2020-03-25T06:50:33.5152274Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:34.5498116Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:35.6031346Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:36.9848425Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:38.0283377Z Dispatcher REST endpoint is up. > 2020-03-25T06:50:38.0285490Z Running externalized checkpoints test, with > ORIGINAL_DOP=2 NEW_DOP=4 and STATE_BACKEND_TYPE=rocks > STATE_BACKEND_FILE_ASYNC=true STATE_BACKEND_ROCKSDB_INCREMENTAL=true > SIMULATE_FAILURE=false ... > 2020-03-25T06:50:46.1754645Z Job (b8cb04e4b1e730585bc616aa352866d0) is > running. > 2020-03-25T06:50:46.1758132Z Waiting for job > (b8cb04e4b1e730585bc616aa352866d0) to have at least 1 completed checkpoints > ... > 2020-03-25T06:50:46.3478276Z Waiting for job to process up to 200 records, > current progress: 173 records ... > 2020-03-25T06:50:49.6332988Z Cancelling job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.4875673Z Cancelled job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.5468230Z ls: cannot access > '/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304/externalized-chckpt-e2e-backend-dir/b8cb04e4b1e730585bc616aa352866d0/chk-[1-9]*/_metadata': > No such file or directory > 2020-03-25T06:50:50.5606260Z Restoring job with externalized checkpoint at . > ... > 2020-03-25T06:50:58.4728245Z > 2020-03-25T06:50:58.4732663Z > > 2020-03-25T06:50:58.4735785Z The program finished with the following > exception: > 2020-03-25T06:50:58.4737759Z > 2020-03-25T06:50:58.4742666Z > org.apache.flink.client.program.ProgramInvocationException: The main method > caused an error: java.util.concurrent.ExecutionException: > org.apache.flink.runtime.client.JobSubmissionException: Failed to submit > JobGraph. > 2020-03-25T06:50:58.4746274Z at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335) > 2020-03-25T06:50:58.4749954Z at > org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205) > 2020-03-25T06:50:58.4752753Z at > org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:142) > 2020-03-25T06:50:58.4755400Z at > org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:659) > 2020-03-25T06:50:58.4757862Z at > org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:210) > 2020-03-25T06:50:58.4760282Z at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:890) > 2020-03-25T06:50:58.4763591Z at >
[jira] [Commented] (FLINK-16770) Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end test fails with no such file
[ https://issues.apache.org/jira/browse/FLINK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074438#comment-17074438 ] Biao Liu commented on FLINK-16770: -- After a short discussion with [~yunta] offline, we reached agreement of the possible solution. [~yunta] will continue working on it. Besides that, we think it's better to quickly fix the failed case first. So other guys could avoid suffering from this unstable failure. I have created a PR to try to resolve the failed case in a work-around way. [~yunta] could you take a look is there anything missing? > Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end > test fails with no such file > --- > > Key: FLINK-16770 > URL: https://issues.apache.org/jira/browse/FLINK-16770 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Tests >Affects Versions: 1.11.0 >Reporter: Zhijiang >Assignee: Yun Tang >Priority: Blocker > Labels: pull-request-available, test-stability > Fix For: 1.11.0 > > Attachments: e2e-output.log, > flink-vsts-standalonesession-0-fv-az53.log > > Time Spent: 10m > Remaining Estimate: 0h > > The log : > [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6603=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5] > > There was also the similar problem in > https://issues.apache.org/jira/browse/FLINK-16561, but for the case of no > parallelism change. And this case is for scaling up. Not quite sure whether > the root cause is the same one. > {code:java} > 2020-03-25T06:50:31.3894841Z Running 'Resuming Externalized Checkpoint > (rocks, incremental, scale up) end-to-end test' > 2020-03-25T06:50:31.3895308Z > == > 2020-03-25T06:50:31.3907274Z TEST_DATA_DIR: > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304 > 2020-03-25T06:50:31.5500274Z Flink dist directory: > /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT > 2020-03-25T06:50:31.6354639Z Starting cluster. > 2020-03-25T06:50:31.8871932Z Starting standalonesession daemon on host > fv-az655. > 2020-03-25T06:50:33.5021784Z Starting taskexecutor daemon on host fv-az655. > 2020-03-25T06:50:33.5152274Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:34.5498116Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:35.6031346Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:36.9848425Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:38.0283377Z Dispatcher REST endpoint is up. > 2020-03-25T06:50:38.0285490Z Running externalized checkpoints test, with > ORIGINAL_DOP=2 NEW_DOP=4 and STATE_BACKEND_TYPE=rocks > STATE_BACKEND_FILE_ASYNC=true STATE_BACKEND_ROCKSDB_INCREMENTAL=true > SIMULATE_FAILURE=false ... > 2020-03-25T06:50:46.1754645Z Job (b8cb04e4b1e730585bc616aa352866d0) is > running. > 2020-03-25T06:50:46.1758132Z Waiting for job > (b8cb04e4b1e730585bc616aa352866d0) to have at least 1 completed checkpoints > ... > 2020-03-25T06:50:46.3478276Z Waiting for job to process up to 200 records, > current progress: 173 records ... > 2020-03-25T06:50:49.6332988Z Cancelling job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.4875673Z Cancelled job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.5468230Z ls: cannot access > '/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304/externalized-chckpt-e2e-backend-dir/b8cb04e4b1e730585bc616aa352866d0/chk-[1-9]*/_metadata': > No such file or directory > 2020-03-25T06:50:50.5606260Z Restoring job with externalized checkpoint at . > ... > 2020-03-25T06:50:58.4728245Z > 2020-03-25T06:50:58.4732663Z > > 2020-03-25T06:50:58.4735785Z The program finished with the following > exception: > 2020-03-25T06:50:58.4737759Z > 2020-03-25T06:50:58.4742666Z > org.apache.flink.client.program.ProgramInvocationException: The main method > caused an error: java.util.concurrent.ExecutionException: > org.apache.flink.runtime.client.JobSubmissionException: Failed to submit > JobGraph. > 2020-03-25T06:50:58.4746274Z at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335) > 2020-03-25T06:50:58.4749954Z at > org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205) > 2020-03-25T06:50:58.4752753Z at > org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:142) > 2020-03-25T06:50:58.4755400Z at >
[jira] [Commented] (FLINK-16770) Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end test fails with no such file
[ https://issues.apache.org/jira/browse/FLINK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073536#comment-17073536 ] Yun Tang commented on FLINK-16770: -- As I said above, I prefer to let {{PendingCheckpoint#dispose}} and {{PendingCheckpoint#finalizeCheckpoint}} to share the same variable so that who enters first would not allow another enter in. However, after I think a bit more, some logic still changed. I will give explanations below: Previously, the job status and checkpoint status looks like below: {code:bash} RUNNING --> pending chk-8 --> FAILING (discard pending chk-8) --> try to complete chk-8, but already discarded --> RESTARTING (from chk-7) {code} However, after FLINK-14971, things changed: {code:java} main thread: RUNNING --> pending chk-8 --> FAILING (discard pending chk-8) > RESTARTING (no checkpoint existed) | async-IO thread: |---> finalizing chk-8 --> chk-8 added, chk-7 subsumed {code} If we just introduce a light-weight shared variable, things could be like: {code:bash} main thread: RUNNING --> pending chk-8 --> FAILING (try to discard pending chk-8) > RESTARTING | || |share variable | || async-IO thread: |---> try to finalize chk-8 --> if finalize first enter, chk-8 added and chk-7 subsumed {code} we might get result as: {code:bash} RUNNING --> pending chk-8 --> FAILING (try to discard pending chk-8) --> completed chk-8 --> RESTARTING {code} or {code:bash} RUNNING --> pending chk-8 --> FAILING (try to discard pending chk-8) --> RESTARTING --> completed chk-8 {code} As you can see, job FAILING which lead to {{CheckpointCoordinator#stopCheckpointScheduler}} would not have the top priority. A strict sync lock between {{CheckpointCoordinator#stopCheckpointScheduler}} and {{PendingCheckpoint#finalizeCheckpoint}} might not help as the async IO phase would subsume checkpoint when adding completed checkpoint. Thus, I currently prefer to change the logic of {{CompletedCheckpointStore}}, it would not subsume checkpoint within itself but only executed when we call it outside from checkpoint coordinator. The new work flow looks like below: {code:bash} main thread: RUNNING --> pending chk-8 --> FAILING (tag pending chk-8 as discarded, cancel async io finalizingFuture) --> RESTARTING from chk-7 | | | async-IO thread: |---> try to finalize chk-8 --> if not tagged as discarded, chk-8 added, or canceled to delete chk-8 {code} or {code:bash} main thread: RUNNING --> pending chk-8 --> > complete chk-8, subsume chk-7 in store --> FAILING --> RESTARTING from chk-8 | | | | | | async-IO thread: |---> try to finalize chk-8 --> if not tagged as discarded, chk-8 added {code} > Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end > test fails with no such file > --- > > Key: FLINK-16770 > URL: https://issues.apache.org/jira/browse/FLINK-16770 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Tests >Affects Versions: 1.11.0 >Reporter: Zhijiang >Assignee: Yun Tang >Priority: Blocker > Labels: test-stability > Fix For: 1.11.0 > > Attachments: e2e-output.log, > flink-vsts-standalonesession-0-fv-az53.log > > > The log : > [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6603=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5] > > There was also the similar problem in > https://issues.apache.org/jira/browse/FLINK-16561, but for the case of no > parallelism change. And this case is for scaling up. Not quite sure whether > the root cause is the same one. > {code:java} > 2020-03-25T06:50:31.3894841Z Running 'Resuming Externalized Checkpoint > (rocks, incremental, scale up) end-to-end test' > 2020-03-25T06:50:31.3895308Z > == > 2020-03-25T06:50:31.3907274Z TEST_DATA_DIR: >
[jira] [Commented] (FLINK-16770) Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end test fails with no such file
[ https://issues.apache.org/jira/browse/FLINK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073388#comment-17073388 ] Biao Liu commented on FLINK-16770: -- Hi [~yunta], thanks for the response. If I understand correctly, there is an inconsistent state of {{CompletedCheckpointStore}} while stopping a checkpoint which is doing asynchronous finalization. There are two strategy here, 1. The checkpoint which is doing finalization could be aborted when {{CheckpointCoordinator}} is being shut down or periodic scheduler is being stopped. This is the choice of current implementation. However we didn't handle the {{CompletedCheckpointStore}} well. For example it might be better that reverting the state of {{CompletedCheckpointStore}} when the {{PendingCheckpoint}} finds the discarding after asynchronous finalization. But I think it's not easy to do so. Because there might be a subsuming operation during {{CompletedCheckpointStore#addCheckpoint}}. 2. The checkpoint which is doing finalization could NOT be aborted when {{CheckpointCoordinator}} is being shut down or period scheduler is being stopped. I personally prefer this solution, because it could simply the concurrent conflict scenario and it's much easier to implement. I think introducing an atomic boolean might not be enough. It's better to rethink the relationship between {{PendingCheckpoint#abort}} and {{PendingCheckpoint#finalizeCheckpoint}}. And we also need to rewrite a part of error handling of the finalization. BTW, [~yunta] could you share the unit test case which could reproduce the scenario locally? I want to verify my suggestion and solution. The original e2e test case is not stable. > Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end > test fails with no such file > --- > > Key: FLINK-16770 > URL: https://issues.apache.org/jira/browse/FLINK-16770 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Tests >Affects Versions: 1.11.0 >Reporter: Zhijiang >Assignee: Yun Tang >Priority: Blocker > Labels: test-stability > Fix For: 1.11.0 > > Attachments: e2e-output.log, > flink-vsts-standalonesession-0-fv-az53.log > > > The log : > [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6603=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5] > > There was also the similar problem in > https://issues.apache.org/jira/browse/FLINK-16561, but for the case of no > parallelism change. And this case is for scaling up. Not quite sure whether > the root cause is the same one. > {code:java} > 2020-03-25T06:50:31.3894841Z Running 'Resuming Externalized Checkpoint > (rocks, incremental, scale up) end-to-end test' > 2020-03-25T06:50:31.3895308Z > == > 2020-03-25T06:50:31.3907274Z TEST_DATA_DIR: > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304 > 2020-03-25T06:50:31.5500274Z Flink dist directory: > /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT > 2020-03-25T06:50:31.6354639Z Starting cluster. > 2020-03-25T06:50:31.8871932Z Starting standalonesession daemon on host > fv-az655. > 2020-03-25T06:50:33.5021784Z Starting taskexecutor daemon on host fv-az655. > 2020-03-25T06:50:33.5152274Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:34.5498116Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:35.6031346Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:36.9848425Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:38.0283377Z Dispatcher REST endpoint is up. > 2020-03-25T06:50:38.0285490Z Running externalized checkpoints test, with > ORIGINAL_DOP=2 NEW_DOP=4 and STATE_BACKEND_TYPE=rocks > STATE_BACKEND_FILE_ASYNC=true STATE_BACKEND_ROCKSDB_INCREMENTAL=true > SIMULATE_FAILURE=false ... > 2020-03-25T06:50:46.1754645Z Job (b8cb04e4b1e730585bc616aa352866d0) is > running. > 2020-03-25T06:50:46.1758132Z Waiting for job > (b8cb04e4b1e730585bc616aa352866d0) to have at least 1 completed checkpoints > ... > 2020-03-25T06:50:46.3478276Z Waiting for job to process up to 200 records, > current progress: 173 records ... > 2020-03-25T06:50:49.6332988Z Cancelling job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.4875673Z Cancelled job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.5468230Z ls: cannot access > '/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304/externalized-chckpt-e2e-backend-dir/b8cb04e4b1e730585bc616aa352866d0/chk-[1-9]*/_metadata': > No such file or
[jira] [Commented] (FLINK-16770) Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end test fails with no such file
[ https://issues.apache.org/jira/browse/FLINK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073041#comment-17073041 ] Yun Tang commented on FLINK-16770: -- [~SleePy] Yes, this bug is introduced from FLINK-14971 and I could reproduce it locally with unit test. There exists a logic competition between {{PendingCheckpoint#dispose}} and {{PendingCheckpoint#finalizeCheckpoint}}, current {{operationLock}} can only ensure the async phase to delete this pending checkpoint and adding completed checkpoint would not happen at the same time. However, this cannot ensure the pending checkpoint would not be firstly added to checkpoint store and then dropped. One quick fix would add atomic boolean to share among these two async operations, once this pending checkpoint is added to checkpoint store, it would not be dropped anymore asynchronously. However, this could lead something misleading: if this pending checkpoint is added to checkpoint store successfully asynchronously but tagged as disposed in the main thread. Although we could avoid to drop this in the async phase of {{PendingCheckpoint#dispose}}, checkpoint coordinator would not treat this pending checkpoint as successful and would not display in the checkpoint web UI. But luckily, we could ensure at least no data will be deleted by mistake, job could still failover by recovering from latest completed checkpoint. Another solution needs to compare and set some atomic variable in the main thread when {{PendingCheckpoint#dispose}} and share that when we try to add checkpoint store. If we firstly arrive to add checkpoint to store, we would not let main thread to tag that pending checkpoint as discarded. On the other hand, if we firstly arrive to tag this pending checkpoint would be discarded, we would not try to add to checkpoint store. I think this could be really light-weight and non-blocking, but it would introduce some extra CAS work in the main thread. What do you think of this ? [~pnowojski] > Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end > test fails with no such file > --- > > Key: FLINK-16770 > URL: https://issues.apache.org/jira/browse/FLINK-16770 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Tests >Affects Versions: 1.11.0 >Reporter: Zhijiang >Assignee: Yun Tang >Priority: Blocker > Labels: test-stability > Fix For: 1.11.0 > > Attachments: e2e-output.log, > flink-vsts-standalonesession-0-fv-az53.log > > > The log : > [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6603=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5] > > There was also the similar problem in > https://issues.apache.org/jira/browse/FLINK-16561, but for the case of no > parallelism change. And this case is for scaling up. Not quite sure whether > the root cause is the same one. > {code:java} > 2020-03-25T06:50:31.3894841Z Running 'Resuming Externalized Checkpoint > (rocks, incremental, scale up) end-to-end test' > 2020-03-25T06:50:31.3895308Z > == > 2020-03-25T06:50:31.3907274Z TEST_DATA_DIR: > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304 > 2020-03-25T06:50:31.5500274Z Flink dist directory: > /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT > 2020-03-25T06:50:31.6354639Z Starting cluster. > 2020-03-25T06:50:31.8871932Z Starting standalonesession daemon on host > fv-az655. > 2020-03-25T06:50:33.5021784Z Starting taskexecutor daemon on host fv-az655. > 2020-03-25T06:50:33.5152274Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:34.5498116Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:35.6031346Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:36.9848425Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:38.0283377Z Dispatcher REST endpoint is up. > 2020-03-25T06:50:38.0285490Z Running externalized checkpoints test, with > ORIGINAL_DOP=2 NEW_DOP=4 and STATE_BACKEND_TYPE=rocks > STATE_BACKEND_FILE_ASYNC=true STATE_BACKEND_ROCKSDB_INCREMENTAL=true > SIMULATE_FAILURE=false ... > 2020-03-25T06:50:46.1754645Z Job (b8cb04e4b1e730585bc616aa352866d0) is > running. > 2020-03-25T06:50:46.1758132Z Waiting for job > (b8cb04e4b1e730585bc616aa352866d0) to have at least 1 completed checkpoints > ... > 2020-03-25T06:50:46.3478276Z Waiting for job to process up to 200 records, > current progress: 173 records ... > 2020-03-25T06:50:49.6332988Z Cancelling job b8cb04e4b1e730585bc616aa352866d0. >
[jira] [Commented] (FLINK-16770) Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end test fails with no such file
[ https://issues.apache.org/jira/browse/FLINK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072553#comment-17072553 ] Biao Liu commented on FLINK-16770: -- Hi [~yunta], thanks for the analysis. I have a question that if chk-8 is dropped when cancelling the job, the chk-7 would not be subsumed since the finalization of chk-8 would not finish after adding to checkpoint store asynchronously. It would check the discarding state before doing subsuming. Although I haven't check the testing case carefully, I guess this might be relevant with FLINK-14971 which make the threading model here asynchronous. There is a small possibility that a checkpoint is discarded but it could be added into checkpoint store successfully. Because currently the cancellation and the manipulation on checkpoint store are in different threads. There is no a big lock for everything as before. Do you think it could cause this failure? > Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end > test fails with no such file > --- > > Key: FLINK-16770 > URL: https://issues.apache.org/jira/browse/FLINK-16770 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Tests >Affects Versions: 1.11.0 >Reporter: Zhijiang >Assignee: Yun Tang >Priority: Blocker > Labels: test-stability > Fix For: 1.11.0 > > Attachments: e2e-output.log, > flink-vsts-standalonesession-0-fv-az53.log > > > The log : > [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6603=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5] > > There was also the similar problem in > https://issues.apache.org/jira/browse/FLINK-16561, but for the case of no > parallelism change. And this case is for scaling up. Not quite sure whether > the root cause is the same one. > {code:java} > 2020-03-25T06:50:31.3894841Z Running 'Resuming Externalized Checkpoint > (rocks, incremental, scale up) end-to-end test' > 2020-03-25T06:50:31.3895308Z > == > 2020-03-25T06:50:31.3907274Z TEST_DATA_DIR: > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304 > 2020-03-25T06:50:31.5500274Z Flink dist directory: > /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT > 2020-03-25T06:50:31.6354639Z Starting cluster. > 2020-03-25T06:50:31.8871932Z Starting standalonesession daemon on host > fv-az655. > 2020-03-25T06:50:33.5021784Z Starting taskexecutor daemon on host fv-az655. > 2020-03-25T06:50:33.5152274Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:34.5498116Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:35.6031346Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:36.9848425Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:38.0283377Z Dispatcher REST endpoint is up. > 2020-03-25T06:50:38.0285490Z Running externalized checkpoints test, with > ORIGINAL_DOP=2 NEW_DOP=4 and STATE_BACKEND_TYPE=rocks > STATE_BACKEND_FILE_ASYNC=true STATE_BACKEND_ROCKSDB_INCREMENTAL=true > SIMULATE_FAILURE=false ... > 2020-03-25T06:50:46.1754645Z Job (b8cb04e4b1e730585bc616aa352866d0) is > running. > 2020-03-25T06:50:46.1758132Z Waiting for job > (b8cb04e4b1e730585bc616aa352866d0) to have at least 1 completed checkpoints > ... > 2020-03-25T06:50:46.3478276Z Waiting for job to process up to 200 records, > current progress: 173 records ... > 2020-03-25T06:50:49.6332988Z Cancelling job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.4875673Z Cancelled job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.5468230Z ls: cannot access > '/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304/externalized-chckpt-e2e-backend-dir/b8cb04e4b1e730585bc616aa352866d0/chk-[1-9]*/_metadata': > No such file or directory > 2020-03-25T06:50:50.5606260Z Restoring job with externalized checkpoint at . > ... > 2020-03-25T06:50:58.4728245Z > 2020-03-25T06:50:58.4732663Z > > 2020-03-25T06:50:58.4735785Z The program finished with the following > exception: > 2020-03-25T06:50:58.4737759Z > 2020-03-25T06:50:58.4742666Z > org.apache.flink.client.program.ProgramInvocationException: The main method > caused an error: java.util.concurrent.ExecutionException: > org.apache.flink.runtime.client.JobSubmissionException: Failed to submit > JobGraph. > 2020-03-25T06:50:58.4746274Z at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335) > 2020-03-25T06:50:58.4749954Z at
[jira] [Commented] (FLINK-16770) Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end test fails with no such file
[ https://issues.apache.org/jira/browse/FLINK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072461#comment-17072461 ] Robert Metzger commented on FLINK-16770: I did not know that you are working on this ticket as well. To avoid duplicate work in the future, it would be nice if you could assign yourself / or write a comment if you are working on something. I will assign both tickets to you. > Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end > test fails with no such file > --- > > Key: FLINK-16770 > URL: https://issues.apache.org/jira/browse/FLINK-16770 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Tests >Affects Versions: 1.11.0 >Reporter: Zhijiang >Assignee: Robert Metzger >Priority: Blocker > Labels: test-stability > Fix For: 1.11.0 > > Attachments: e2e-output.log, > flink-vsts-standalonesession-0-fv-az53.log > > > The log : > [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6603=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5] > > There was also the similar problem in > https://issues.apache.org/jira/browse/FLINK-16561, but for the case of no > parallelism change. And this case is for scaling up. Not quite sure whether > the root cause is the same one. > {code:java} > 2020-03-25T06:50:31.3894841Z Running 'Resuming Externalized Checkpoint > (rocks, incremental, scale up) end-to-end test' > 2020-03-25T06:50:31.3895308Z > == > 2020-03-25T06:50:31.3907274Z TEST_DATA_DIR: > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304 > 2020-03-25T06:50:31.5500274Z Flink dist directory: > /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT > 2020-03-25T06:50:31.6354639Z Starting cluster. > 2020-03-25T06:50:31.8871932Z Starting standalonesession daemon on host > fv-az655. > 2020-03-25T06:50:33.5021784Z Starting taskexecutor daemon on host fv-az655. > 2020-03-25T06:50:33.5152274Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:34.5498116Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:35.6031346Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:36.9848425Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:38.0283377Z Dispatcher REST endpoint is up. > 2020-03-25T06:50:38.0285490Z Running externalized checkpoints test, with > ORIGINAL_DOP=2 NEW_DOP=4 and STATE_BACKEND_TYPE=rocks > STATE_BACKEND_FILE_ASYNC=true STATE_BACKEND_ROCKSDB_INCREMENTAL=true > SIMULATE_FAILURE=false ... > 2020-03-25T06:50:46.1754645Z Job (b8cb04e4b1e730585bc616aa352866d0) is > running. > 2020-03-25T06:50:46.1758132Z Waiting for job > (b8cb04e4b1e730585bc616aa352866d0) to have at least 1 completed checkpoints > ... > 2020-03-25T06:50:46.3478276Z Waiting for job to process up to 200 records, > current progress: 173 records ... > 2020-03-25T06:50:49.6332988Z Cancelling job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.4875673Z Cancelled job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.5468230Z ls: cannot access > '/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304/externalized-chckpt-e2e-backend-dir/b8cb04e4b1e730585bc616aa352866d0/chk-[1-9]*/_metadata': > No such file or directory > 2020-03-25T06:50:50.5606260Z Restoring job with externalized checkpoint at . > ... > 2020-03-25T06:50:58.4728245Z > 2020-03-25T06:50:58.4732663Z > > 2020-03-25T06:50:58.4735785Z The program finished with the following > exception: > 2020-03-25T06:50:58.4737759Z > 2020-03-25T06:50:58.4742666Z > org.apache.flink.client.program.ProgramInvocationException: The main method > caused an error: java.util.concurrent.ExecutionException: > org.apache.flink.runtime.client.JobSubmissionException: Failed to submit > JobGraph. > 2020-03-25T06:50:58.4746274Z at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335) > 2020-03-25T06:50:58.4749954Z at > org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205) > 2020-03-25T06:50:58.4752753Z at > org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:142) > 2020-03-25T06:50:58.4755400Z at > org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:659) > 2020-03-25T06:50:58.4757862Z at > org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:210) > 2020-03-25T06:50:58.4760282Z at >
[jira] [Commented] (FLINK-16770) Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end test fails with no such file
[ https://issues.apache.org/jira/browse/FLINK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072458#comment-17072458 ] Yun Tang commented on FLINK-16770: -- [~rmetzger] I have reproduced this with additional logs in my [private branch |https://github.com/Myasuka/flink/tree/travis-fix-bug] and personal azure pipeline https://myasuka.visualstudio.com/flink/_build/results?buildId=10=logs=1f3ed471-1849-5d3c-a34c-19792af4ad16=2f5b54d0-1d28-5b01-d344-aa50ffe0cdf8 . >From the addtional logs, I have figured out why this could happen: {{CheckpointCoordinator}} drop the pending checkpoint-8 when cancelling the job in {{CheckpointCoordinator#stopCheckpointScheduler}}, however, chk-8 has just been asynchronously added to checkpoint store successfully during {{PendingCheckpoint#finalizeCheckpoint}}. On the other hand, since the job is not shut down, the main thread executor will then subsume chk-7 with the successful result returned by {{PendingCheckpoint#finalizeCheckpoint}}. That's why we could see logs: {code:java} Checkpoint with ID 8 at 'file:/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-07881636808/externalized-chckpt-e2e-backend-dir/0329a0facde65e8a8432124ce5db8e3c/chk-8' not discarded. {code} In the end, chk-8 is deleted when we stop the scheduler and chk-7 is delete when we know chk-8 is successfully added to checkpoint store. I am not sure whether you have ever did some work to figure out the root cause, please assign this ticket to me if you don't mind and not doing so much work ever. > Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end > test fails with no such file > --- > > Key: FLINK-16770 > URL: https://issues.apache.org/jira/browse/FLINK-16770 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Tests >Affects Versions: 1.11.0 >Reporter: Zhijiang >Assignee: Robert Metzger >Priority: Blocker > Labels: test-stability > Fix For: 1.11.0 > > Attachments: e2e-output.log, > flink-vsts-standalonesession-0-fv-az53.log > > > The log : > [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6603=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5] > > There was also the similar problem in > https://issues.apache.org/jira/browse/FLINK-16561, but for the case of no > parallelism change. And this case is for scaling up. Not quite sure whether > the root cause is the same one. > {code:java} > 2020-03-25T06:50:31.3894841Z Running 'Resuming Externalized Checkpoint > (rocks, incremental, scale up) end-to-end test' > 2020-03-25T06:50:31.3895308Z > == > 2020-03-25T06:50:31.3907274Z TEST_DATA_DIR: > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304 > 2020-03-25T06:50:31.5500274Z Flink dist directory: > /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT > 2020-03-25T06:50:31.6354639Z Starting cluster. > 2020-03-25T06:50:31.8871932Z Starting standalonesession daemon on host > fv-az655. > 2020-03-25T06:50:33.5021784Z Starting taskexecutor daemon on host fv-az655. > 2020-03-25T06:50:33.5152274Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:34.5498116Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:35.6031346Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:36.9848425Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:38.0283377Z Dispatcher REST endpoint is up. > 2020-03-25T06:50:38.0285490Z Running externalized checkpoints test, with > ORIGINAL_DOP=2 NEW_DOP=4 and STATE_BACKEND_TYPE=rocks > STATE_BACKEND_FILE_ASYNC=true STATE_BACKEND_ROCKSDB_INCREMENTAL=true > SIMULATE_FAILURE=false ... > 2020-03-25T06:50:46.1754645Z Job (b8cb04e4b1e730585bc616aa352866d0) is > running. > 2020-03-25T06:50:46.1758132Z Waiting for job > (b8cb04e4b1e730585bc616aa352866d0) to have at least 1 completed checkpoints > ... > 2020-03-25T06:50:46.3478276Z Waiting for job to process up to 200 records, > current progress: 173 records ... > 2020-03-25T06:50:49.6332988Z Cancelling job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.4875673Z Cancelled job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.5468230Z ls: cannot access > '/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304/externalized-chckpt-e2e-backend-dir/b8cb04e4b1e730585bc616aa352866d0/chk-[1-9]*/_metadata': > No such file or directory > 2020-03-25T06:50:50.5606260Z Restoring job with externalized checkpoint at . > ... >
[jira] [Commented] (FLINK-16770) Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end test fails with no such file
[ https://issues.apache.org/jira/browse/FLINK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072451#comment-17072451 ] Robert Metzger commented on FLINK-16770: I have understood the following so far: - the test is searching for the checkpoint directory, but no checkpoint exists - It seems that the checkpoint N does not get retained, if N+1 gets triggered and the job gets cancelled immediately thereafter. The job has checkpoint retention on cancellation enabled. Proof: {code} $ cat flink-vsts-standalonesession-0-fv-az53.log | grep "CheckpointCoo\|job.lastCheckpointExternalPath\|switched from state" 2020-04-01 06:30:18,805 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job General purpose test job (c34d2f91cf100e020226725452b5000a) switched from state CREATED to RUNNING. localhost.jobmanager.General purpose test job.lastCheckpointExternalPath: n/a 2020-04-01 06:30:19,571 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Checkpoint triggering task Source: EventSource -> Timestamps/Watermarks (1/4) of job c34d2f91cf100e020226725452b5000a is not in state RUNNING but DEPLOYING instead. Aborting checkpoint. localhost.jobmanager.General purpose test job.lastCheckpointExternalPath: n/a 2020-04-01 06:30:20,597 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Triggering checkpoint 1 @ 1585722620570 for job c34d2f91cf100e020226725452b5000a. 2020-04-01 06:30:21,170 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Completed checkpoint 1 for job c34d2f91cf100e020226725452b5000a (158574 bytes in 597 ms). localhost.jobmanager.General purpose test job.lastCheckpointExternalPath: file:/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-08993937085/externalized-chckpt-e2e-backend-dir/c34d2f91cf100e020226725452b5000a/chk-1 2020-04-01 06:30:21,570 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Triggering checkpoint 2 @ 1585722621570 for job c34d2f91cf100e020226725452b5000a. 2020-04-01 06:30:21,689 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Completed checkpoint 2 for job c34d2f91cf100e020226725452b5000a (274341 bytes in 113 ms). localhost.jobmanager.General purpose test job.lastCheckpointExternalPath: file:/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-08993937085/externalized-chckpt-e2e-backend-dir/c34d2f91cf100e020226725452b5000a/chk-2 2020-04-01 06:30:22,571 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Triggering checkpoint 3 @ 1585722622570 for job c34d2f91cf100e020226725452b5000a. 2020-04-01 06:30:22,689 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Completed checkpoint 3 for job c34d2f91cf100e020226725452b5000a (326291 bytes in 118 ms). localhost.jobmanager.General purpose test job.lastCheckpointExternalPath: file:/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-08993937085/externalized-chckpt-e2e-backend-dir/c34d2f91cf100e020226725452b5000a/chk-3 2020-04-01 06:30:23,571 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Triggering checkpoint 4 @ 1585722623570 for job c34d2f91cf100e020226725452b5000a. 2020-04-01 06:30:23,650 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Completed checkpoint 4 for job c34d2f91cf100e020226725452b5000a (341697 bytes in 78 ms). localhost.jobmanager.General purpose test job.lastCheckpointExternalPath: file:/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-08993937085/externalized-chckpt-e2e-backend-dir/c34d2f91cf100e020226725452b5000a/chk-4 2020-04-01 06:30:24,570 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Triggering checkpoint 5 @ 1585722624570 for job c34d2f91cf100e020226725452b5000a. 2020-04-01 06:30:24,643 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Completed checkpoint 5 for job c34d2f91cf100e020226725452b5000a (345026 bytes in 72 ms). localhost.jobmanager.General purpose test job.lastCheckpointExternalPath: file:/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-08993937085/externalized-chckpt-e2e-backend-dir/c34d2f91cf100e020226725452b5000a/chk-5 2020-04-01 06:30:25,571 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Triggering checkpoint 6 @ 1585722625570 for job c34d2f91cf100e020226725452b5000a. 2020-04-01 06:30:25,659 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Completed checkpoint 6 for job c34d2f91cf100e020226725452b5000a (347049 bytes in 88 ms). localhost.jobmanager.General purpose test job.lastCheckpointExternalPath: file:/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-08993937085/externalized-chckpt-e2e-backend-dir/c34d2f91cf100e020226725452b5000a/chk-6
[jira] [Commented] (FLINK-16770) Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end test fails with no such file
[ https://issues.apache.org/jira/browse/FLINK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072031#comment-17072031 ] Robert Metzger commented on FLINK-16770: Another instance: https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6892=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5 Assigning myself to address this ... > Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end > test fails with no such file > --- > > Key: FLINK-16770 > URL: https://issues.apache.org/jira/browse/FLINK-16770 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Tests >Reporter: Zhijiang >Priority: Blocker > Labels: test-stability > Fix For: 1.11.0 > > > The log : > [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6603=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5] > > There was also the similar problem in > https://issues.apache.org/jira/browse/FLINK-16561, but for the case of no > parallelism change. And this case is for scaling up. Not quite sure whether > the root cause is the same one. > {code:java} > 2020-03-25T06:50:31.3894841Z Running 'Resuming Externalized Checkpoint > (rocks, incremental, scale up) end-to-end test' > 2020-03-25T06:50:31.3895308Z > == > 2020-03-25T06:50:31.3907274Z TEST_DATA_DIR: > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304 > 2020-03-25T06:50:31.5500274Z Flink dist directory: > /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT > 2020-03-25T06:50:31.6354639Z Starting cluster. > 2020-03-25T06:50:31.8871932Z Starting standalonesession daemon on host > fv-az655. > 2020-03-25T06:50:33.5021784Z Starting taskexecutor daemon on host fv-az655. > 2020-03-25T06:50:33.5152274Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:34.5498116Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:35.6031346Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:36.9848425Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:38.0283377Z Dispatcher REST endpoint is up. > 2020-03-25T06:50:38.0285490Z Running externalized checkpoints test, with > ORIGINAL_DOP=2 NEW_DOP=4 and STATE_BACKEND_TYPE=rocks > STATE_BACKEND_FILE_ASYNC=true STATE_BACKEND_ROCKSDB_INCREMENTAL=true > SIMULATE_FAILURE=false ... > 2020-03-25T06:50:46.1754645Z Job (b8cb04e4b1e730585bc616aa352866d0) is > running. > 2020-03-25T06:50:46.1758132Z Waiting for job > (b8cb04e4b1e730585bc616aa352866d0) to have at least 1 completed checkpoints > ... > 2020-03-25T06:50:46.3478276Z Waiting for job to process up to 200 records, > current progress: 173 records ... > 2020-03-25T06:50:49.6332988Z Cancelling job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.4875673Z Cancelled job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.5468230Z ls: cannot access > '/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304/externalized-chckpt-e2e-backend-dir/b8cb04e4b1e730585bc616aa352866d0/chk-[1-9]*/_metadata': > No such file or directory > 2020-03-25T06:50:50.5606260Z Restoring job with externalized checkpoint at . > ... > 2020-03-25T06:50:58.4728245Z > 2020-03-25T06:50:58.4732663Z > > 2020-03-25T06:50:58.4735785Z The program finished with the following > exception: > 2020-03-25T06:50:58.4737759Z > 2020-03-25T06:50:58.4742666Z > org.apache.flink.client.program.ProgramInvocationException: The main method > caused an error: java.util.concurrent.ExecutionException: > org.apache.flink.runtime.client.JobSubmissionException: Failed to submit > JobGraph. > 2020-03-25T06:50:58.4746274Z at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335) > 2020-03-25T06:50:58.4749954Z at > org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205) > 2020-03-25T06:50:58.4752753Z at > org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:142) > 2020-03-25T06:50:58.4755400Z at > org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:659) > 2020-03-25T06:50:58.4757862Z at > org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:210) > 2020-03-25T06:50:58.4760282Z at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:890) > 2020-03-25T06:50:58.4763591Z at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:963) > 2020-03-25T06:50:58.4764274Z
[jira] [Commented] (FLINK-16770) Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end test fails with no such file
[ https://issues.apache.org/jira/browse/FLINK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071768#comment-17071768 ] Piotr Nowojski commented on FLINK-16770: another instance: https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6882=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5 > Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end > test fails with no such file > --- > > Key: FLINK-16770 > URL: https://issues.apache.org/jira/browse/FLINK-16770 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Tests >Reporter: Zhijiang >Priority: Critical > Labels: test-stability > Fix For: 1.11.0 > > > The log : > [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6603=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5] > > There was also the similar problem in > https://issues.apache.org/jira/browse/FLINK-16561, but for the case of no > parallelism change. And this case is for scaling up. Not quite sure whether > the root cause is the same one. > {code:java} > 2020-03-25T06:50:31.3894841Z Running 'Resuming Externalized Checkpoint > (rocks, incremental, scale up) end-to-end test' > 2020-03-25T06:50:31.3895308Z > == > 2020-03-25T06:50:31.3907274Z TEST_DATA_DIR: > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304 > 2020-03-25T06:50:31.5500274Z Flink dist directory: > /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT > 2020-03-25T06:50:31.6354639Z Starting cluster. > 2020-03-25T06:50:31.8871932Z Starting standalonesession daemon on host > fv-az655. > 2020-03-25T06:50:33.5021784Z Starting taskexecutor daemon on host fv-az655. > 2020-03-25T06:50:33.5152274Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:34.5498116Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:35.6031346Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:36.9848425Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:38.0283377Z Dispatcher REST endpoint is up. > 2020-03-25T06:50:38.0285490Z Running externalized checkpoints test, with > ORIGINAL_DOP=2 NEW_DOP=4 and STATE_BACKEND_TYPE=rocks > STATE_BACKEND_FILE_ASYNC=true STATE_BACKEND_ROCKSDB_INCREMENTAL=true > SIMULATE_FAILURE=false ... > 2020-03-25T06:50:46.1754645Z Job (b8cb04e4b1e730585bc616aa352866d0) is > running. > 2020-03-25T06:50:46.1758132Z Waiting for job > (b8cb04e4b1e730585bc616aa352866d0) to have at least 1 completed checkpoints > ... > 2020-03-25T06:50:46.3478276Z Waiting for job to process up to 200 records, > current progress: 173 records ... > 2020-03-25T06:50:49.6332988Z Cancelling job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.4875673Z Cancelled job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.5468230Z ls: cannot access > '/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304/externalized-chckpt-e2e-backend-dir/b8cb04e4b1e730585bc616aa352866d0/chk-[1-9]*/_metadata': > No such file or directory > 2020-03-25T06:50:50.5606260Z Restoring job with externalized checkpoint at . > ... > 2020-03-25T06:50:58.4728245Z > 2020-03-25T06:50:58.4732663Z > > 2020-03-25T06:50:58.4735785Z The program finished with the following > exception: > 2020-03-25T06:50:58.4737759Z > 2020-03-25T06:50:58.4742666Z > org.apache.flink.client.program.ProgramInvocationException: The main method > caused an error: java.util.concurrent.ExecutionException: > org.apache.flink.runtime.client.JobSubmissionException: Failed to submit > JobGraph. > 2020-03-25T06:50:58.4746274Z at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335) > 2020-03-25T06:50:58.4749954Z at > org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205) > 2020-03-25T06:50:58.4752753Z at > org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:142) > 2020-03-25T06:50:58.4755400Z at > org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:659) > 2020-03-25T06:50:58.4757862Z at > org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:210) > 2020-03-25T06:50:58.4760282Z at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:890) > 2020-03-25T06:50:58.4763591Z at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:963) > 2020-03-25T06:50:58.4764274Z at >
[jira] [Commented] (FLINK-16770) Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end test fails with no such file
[ https://issues.apache.org/jira/browse/FLINK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17070233#comment-17070233 ] Zhijiang commented on FLINK-16770: -- Another instance [https://travis-ci.org/apache/flink/builds/668073755?utm_source=slack_medium=notification] > Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end > test fails with no such file > --- > > Key: FLINK-16770 > URL: https://issues.apache.org/jira/browse/FLINK-16770 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Tests >Reporter: Zhijiang >Priority: Critical > Labels: test-stability > Fix For: 1.11.0 > > > The log : > [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6603=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5] > > There was also the similar problem in > https://issues.apache.org/jira/browse/FLINK-16561, but for the case of no > parallelism change. And this case is for scaling up. Not quite sure whether > the root cause is the same one. > {code:java} > 2020-03-25T06:50:31.3894841Z Running 'Resuming Externalized Checkpoint > (rocks, incremental, scale up) end-to-end test' > 2020-03-25T06:50:31.3895308Z > == > 2020-03-25T06:50:31.3907274Z TEST_DATA_DIR: > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304 > 2020-03-25T06:50:31.5500274Z Flink dist directory: > /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT > 2020-03-25T06:50:31.6354639Z Starting cluster. > 2020-03-25T06:50:31.8871932Z Starting standalonesession daemon on host > fv-az655. > 2020-03-25T06:50:33.5021784Z Starting taskexecutor daemon on host fv-az655. > 2020-03-25T06:50:33.5152274Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:34.5498116Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:35.6031346Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:36.9848425Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:38.0283377Z Dispatcher REST endpoint is up. > 2020-03-25T06:50:38.0285490Z Running externalized checkpoints test, with > ORIGINAL_DOP=2 NEW_DOP=4 and STATE_BACKEND_TYPE=rocks > STATE_BACKEND_FILE_ASYNC=true STATE_BACKEND_ROCKSDB_INCREMENTAL=true > SIMULATE_FAILURE=false ... > 2020-03-25T06:50:46.1754645Z Job (b8cb04e4b1e730585bc616aa352866d0) is > running. > 2020-03-25T06:50:46.1758132Z Waiting for job > (b8cb04e4b1e730585bc616aa352866d0) to have at least 1 completed checkpoints > ... > 2020-03-25T06:50:46.3478276Z Waiting for job to process up to 200 records, > current progress: 173 records ... > 2020-03-25T06:50:49.6332988Z Cancelling job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.4875673Z Cancelled job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.5468230Z ls: cannot access > '/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304/externalized-chckpt-e2e-backend-dir/b8cb04e4b1e730585bc616aa352866d0/chk-[1-9]*/_metadata': > No such file or directory > 2020-03-25T06:50:50.5606260Z Restoring job with externalized checkpoint at . > ... > 2020-03-25T06:50:58.4728245Z > 2020-03-25T06:50:58.4732663Z > > 2020-03-25T06:50:58.4735785Z The program finished with the following > exception: > 2020-03-25T06:50:58.4737759Z > 2020-03-25T06:50:58.4742666Z > org.apache.flink.client.program.ProgramInvocationException: The main method > caused an error: java.util.concurrent.ExecutionException: > org.apache.flink.runtime.client.JobSubmissionException: Failed to submit > JobGraph. > 2020-03-25T06:50:58.4746274Z at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335) > 2020-03-25T06:50:58.4749954Z at > org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205) > 2020-03-25T06:50:58.4752753Z at > org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:142) > 2020-03-25T06:50:58.4755400Z at > org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:659) > 2020-03-25T06:50:58.4757862Z at > org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:210) > 2020-03-25T06:50:58.4760282Z at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:890) > 2020-03-25T06:50:58.4763591Z at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:963) > 2020-03-25T06:50:58.4764274Z at > java.security.AccessController.doPrivileged(Native Method) > 2020-03-25T06:50:58.4764809Z at >
[jira] [Commented] (FLINK-16770) Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end test fails with no such file
[ https://issues.apache.org/jira/browse/FLINK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17070230#comment-17070230 ] Zhijiang commented on FLINK-16770: -- Another instance [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6788=logs=7bafe89a-737e-5a81-708c-24b72a2345fc=8f0197c1-92aa-5b5f-4284-1ae542d75a1e] > Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end > test fails with no such file > --- > > Key: FLINK-16770 > URL: https://issues.apache.org/jira/browse/FLINK-16770 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Tests >Reporter: Zhijiang >Priority: Major > Labels: test-stability > Fix For: 1.11.0 > > > The log : > [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6603=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5] > > There was also the similar problem in > https://issues.apache.org/jira/browse/FLINK-16561, but for the case of no > parallelism change. And this case is for scaling up. Not quite sure whether > the root cause is the same one. > {code:java} > 2020-03-25T06:50:31.3894841Z Running 'Resuming Externalized Checkpoint > (rocks, incremental, scale up) end-to-end test' > 2020-03-25T06:50:31.3895308Z > == > 2020-03-25T06:50:31.3907274Z TEST_DATA_DIR: > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304 > 2020-03-25T06:50:31.5500274Z Flink dist directory: > /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT > 2020-03-25T06:50:31.6354639Z Starting cluster. > 2020-03-25T06:50:31.8871932Z Starting standalonesession daemon on host > fv-az655. > 2020-03-25T06:50:33.5021784Z Starting taskexecutor daemon on host fv-az655. > 2020-03-25T06:50:33.5152274Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:34.5498116Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:35.6031346Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:36.9848425Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:38.0283377Z Dispatcher REST endpoint is up. > 2020-03-25T06:50:38.0285490Z Running externalized checkpoints test, with > ORIGINAL_DOP=2 NEW_DOP=4 and STATE_BACKEND_TYPE=rocks > STATE_BACKEND_FILE_ASYNC=true STATE_BACKEND_ROCKSDB_INCREMENTAL=true > SIMULATE_FAILURE=false ... > 2020-03-25T06:50:46.1754645Z Job (b8cb04e4b1e730585bc616aa352866d0) is > running. > 2020-03-25T06:50:46.1758132Z Waiting for job > (b8cb04e4b1e730585bc616aa352866d0) to have at least 1 completed checkpoints > ... > 2020-03-25T06:50:46.3478276Z Waiting for job to process up to 200 records, > current progress: 173 records ... > 2020-03-25T06:50:49.6332988Z Cancelling job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.4875673Z Cancelled job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.5468230Z ls: cannot access > '/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304/externalized-chckpt-e2e-backend-dir/b8cb04e4b1e730585bc616aa352866d0/chk-[1-9]*/_metadata': > No such file or directory > 2020-03-25T06:50:50.5606260Z Restoring job with externalized checkpoint at . > ... > 2020-03-25T06:50:58.4728245Z > 2020-03-25T06:50:58.4732663Z > > 2020-03-25T06:50:58.4735785Z The program finished with the following > exception: > 2020-03-25T06:50:58.4737759Z > 2020-03-25T06:50:58.4742666Z > org.apache.flink.client.program.ProgramInvocationException: The main method > caused an error: java.util.concurrent.ExecutionException: > org.apache.flink.runtime.client.JobSubmissionException: Failed to submit > JobGraph. > 2020-03-25T06:50:58.4746274Z at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335) > 2020-03-25T06:50:58.4749954Z at > org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205) > 2020-03-25T06:50:58.4752753Z at > org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:142) > 2020-03-25T06:50:58.4755400Z at > org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:659) > 2020-03-25T06:50:58.4757862Z at > org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:210) > 2020-03-25T06:50:58.4760282Z at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:890) > 2020-03-25T06:50:58.4763591Z at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:963) > 2020-03-25T06:50:58.4764274Z at >
[jira] [Commented] (FLINK-16770) Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end test fails with no such file
[ https://issues.apache.org/jira/browse/FLINK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17066584#comment-17066584 ] Yun Tang commented on FLINK-16770: -- I think this issue is the same root cause as FLINK-16561 , and cannot come to an idea why this could happen if we retain checkpoint and at least one checkpoint completed. If there any place to find the cluster logs in detail? It seems it has uploaded logs failed. > Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end > test fails with no such file > --- > > Key: FLINK-16770 > URL: https://issues.apache.org/jira/browse/FLINK-16770 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Tests >Reporter: Zhijiang >Priority: Major > Labels: test-stability > Fix For: 1.11.0 > > > The log : > [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6603=logs=c88eea3b-64a0-564d-0031-9fdcd7b8abee=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5] > > There was also the similar problem in > https://issues.apache.org/jira/browse/FLINK-16561, but for the case of no > parallelism change. And this case is for scaling up. Not quite sure whether > the root cause is the same one. > {code:java} > 2020-03-25T06:50:31.3894841Z Running 'Resuming Externalized Checkpoint > (rocks, incremental, scale up) end-to-end test' > 2020-03-25T06:50:31.3895308Z > == > 2020-03-25T06:50:31.3907274Z TEST_DATA_DIR: > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304 > 2020-03-25T06:50:31.5500274Z Flink dist directory: > /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT > 2020-03-25T06:50:31.6354639Z Starting cluster. > 2020-03-25T06:50:31.8871932Z Starting standalonesession daemon on host > fv-az655. > 2020-03-25T06:50:33.5021784Z Starting taskexecutor daemon on host fv-az655. > 2020-03-25T06:50:33.5152274Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:34.5498116Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:35.6031346Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:36.9848425Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:38.0283377Z Dispatcher REST endpoint is up. > 2020-03-25T06:50:38.0285490Z Running externalized checkpoints test, with > ORIGINAL_DOP=2 NEW_DOP=4 and STATE_BACKEND_TYPE=rocks > STATE_BACKEND_FILE_ASYNC=true STATE_BACKEND_ROCKSDB_INCREMENTAL=true > SIMULATE_FAILURE=false ... > 2020-03-25T06:50:46.1754645Z Job (b8cb04e4b1e730585bc616aa352866d0) is > running. > 2020-03-25T06:50:46.1758132Z Waiting for job > (b8cb04e4b1e730585bc616aa352866d0) to have at least 1 completed checkpoints > ... > 2020-03-25T06:50:46.3478276Z Waiting for job to process up to 200 records, > current progress: 173 records ... > 2020-03-25T06:50:49.6332988Z Cancelling job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.4875673Z Cancelled job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.5468230Z ls: cannot access > '/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304/externalized-chckpt-e2e-backend-dir/b8cb04e4b1e730585bc616aa352866d0/chk-[1-9]*/_metadata': > No such file or directory > 2020-03-25T06:50:50.5606260Z Restoring job with externalized checkpoint at . > ... > 2020-03-25T06:50:58.4728245Z > 2020-03-25T06:50:58.4732663Z > > 2020-03-25T06:50:58.4735785Z The program finished with the following > exception: > 2020-03-25T06:50:58.4737759Z > 2020-03-25T06:50:58.4742666Z > org.apache.flink.client.program.ProgramInvocationException: The main method > caused an error: java.util.concurrent.ExecutionException: > org.apache.flink.runtime.client.JobSubmissionException: Failed to submit > JobGraph. > 2020-03-25T06:50:58.4746274Z at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335) > 2020-03-25T06:50:58.4749954Z at > org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205) > 2020-03-25T06:50:58.4752753Z at > org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:142) > 2020-03-25T06:50:58.4755400Z at > org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:659) > 2020-03-25T06:50:58.4757862Z at > org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:210) > 2020-03-25T06:50:58.4760282Z at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:890) > 2020-03-25T06:50:58.4763591Z at >