Essentially this exception just means that the savepoint operation took longer than the CLI expected.

This can occur for a number of reasons; maybe everything is working as expected but the timeout is just too low (controlled via "client.timeout"). It could also be that the savepoint operation takes abnormally long; for example due to IO bottlenecks.

I suggest to look into the JobManager logs to see whether the savepoint was actually created / the application shut down, and if so then maybe just increase the timeouts.

On 5/11/2021 9:06 AM, Diwakar Jha wrote:
Hello,

I'm trying to use the flink 1.11 stop command to gracefully shutdown application with savepoint.

    flink stop --savepointPath s3a://path_to_save_point
    c5d52e0146258f80fd52a3bf002d2a1b  -yid application_1620673166934_0001


    2021-05-11 06:26:57,852 ERROR
    org.apache.flink.client.cli.CliFrontend [] - Error while running
    the command.
    org.apache.flink.util.FlinkException: Could not stop with a
    savepoint job "c5d52e0146258f80fd52a3bf002d2a1b".
    at
    org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:495)
    ~[flink-dist_2.12-1.11.0.jar:1.11.0]
    at
    
org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:864)
    ~[flink-dist_2.12-1.11.0.jar:1.11.0]
    at
    org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:487)
    ~[flink-dist_2.12-1.11.0.jar:1.11.0]
    at
    
org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:931)
    ~[flink-dist_2.12-1.11.0.jar:1.11.0]
    at
    org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:992)
    ~[flink-dist_2.12-1.11.0.jar:1.11.0]
    at java.security.AccessController.doPrivileged(Native Method)
    ~[?:1.8.0_252]
    at javax.security.auth.Subject.doAs(Subject.java:422) [?:1.8.0_252]
    at
    
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
    [hadoop-common-3.2.1-amzn-1.jar:?]
    at
    
org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
    [flink-dist_2.12-1.11.0.jar:1.11.0]
    at
    org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:992)
    [flink-dist_2.12-1.11.0.jar:1.11.0]
    Caused by: java.util.concurrent.TimeoutException
    at
    java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
    ~[?:1.8.0_252]
    at
    java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
    ~[?:1.8.0_252]
    at
    org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:493)
    ~[flink-dist_2.12-1.11.0.jar:1.11.0]
    ... 9 more


Cancel command seems to be working fine.
Please let me know how to fix this TimeoutException.

Thanks.


Reply via email to