I see, thanks. Looks like it's better for us to switch to triggering savepoint & cancel separately.
On Wed, Aug 22, 2018 at 1:26 PM Till Rohrmann <trohrm...@apache.org> wrote: > Calling cancel-with-savepoint multiple times will trigger multiple > savepoints. The first issued savepoint will complete first and then cancel > the job. Thus, the later savepoints might complete or not depending on the > correct timing. Since savepoint can flush results to external systems, I > would recommend not calling the API multiple times. > > Cheers, > Till > > On Wed, Aug 22, 2018 at 10:40 AM Juho Autio <juho.au...@rovio.com> wrote: > >> What I meant to ask was, does it do any harm to keep calling >> cancel-with-savepoint until the job exits? If the job is already cancelling >> with savepoint, I would assume that another cancel-with-savepoint call is >> just ignored. >> >> On Tue, Aug 21, 2018 at 1:18 PM Till Rohrmann <trohrm...@apache.org> >> wrote: >> >>> Just a small addition. Concurrent cancel call will interfere with the >>> cancel-with-savepoint command and directly cancel the job. So it is better >>> to use the cancel-with-savepoint call in order to take savepoint and then >>> cancel the job automatically. >>> >>> Cheers, >>> Till >>> >>> On Thu, Aug 9, 2018 at 9:53 AM vino yang <yanghua1...@gmail.com> wrote: >>> >>>> Hi Juho, >>>> >>>> We use REST client API : triggerSavepoint(), this API returns a >>>> CompletableFuture, then we call it's get() API. >>>> >>>> You can understand that I am waiting for it to complete in sync. >>>> Because cancelWithSavepoint is actually waiting for savepoint to >>>> complete synchronization, and then execute the cancel command. >>>> >>>> We do not use CLI. I think since you are through the CLI, you can >>>> observe whether the savepoint is complete by combining the log or the web >>>> UI. >>>> >>>> Thanks, vino. >>>> >>>> >>>> Juho Autio <juho.au...@rovio.com> 于2018年8月9日周四 下午3:07写道: >>>> >>>>> Thanks for the suggestion. Is the separate savepoint triggering async? >>>>> Would you then separately poll for the savepoint's completion before >>>>> executing cancel? If additional polling is needed, then I would say that >>>>> for my purpose it's still easier to call cancel with savepoint and simply >>>>> ignore the result of the call. I would assume that it won't do any harm if >>>>> I keep retrying cancel with savepoint until the job stops – I expect that >>>>> an overlapping cancel request is ignored if the job is already creating a >>>>> savepoint. Please correct if my assumption is wrong. >>>>> >>>>> On Thu, Aug 9, 2018 at 5:04 AM vino yang <yanghua1...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Juho, >>>>>> >>>>>> This problem does exist, I suggest you separate these two steps to >>>>>> temporarily deal with this problem: >>>>>> 1) Trigger Savepoint separately; >>>>>> 2) execute the cancel command; >>>>>> >>>>>> Hi Till, Chesnay: >>>>>> >>>>>> Our internal environment and multiple users on the mailing list have >>>>>> encountered similar problems. >>>>>> >>>>>> In our environment, it seems that JM shows that the save point is >>>>>> complete and JM has stopped itself, but the client will still connect to >>>>>> the old JM and report a timeout exception. >>>>>> >>>>>> Thanks, vino. >>>>>> >>>>>> >>>>>> Juho Autio <juho.au...@rovio.com> 于2018年8月8日周三 下午9:18写道: >>>>>> >>>>>>> I was trying to cancel a job with savepoint, but the CLI command >>>>>>> failed with "akka.pattern.AskTimeoutException: Ask timed out". >>>>>>> >>>>>>> The stack trace reveals that ask timeout is 10 seconds: >>>>>>> >>>>>>> Caused by: akka.pattern.AskTimeoutException: Ask timed out on >>>>>>> [Actor[akka://flink/user/jobmanager_0#106635280]] after [10000 ms]. >>>>>>> Sender[null] sent message of type >>>>>>> "org.apache.flink.runtime.rpc.messages.LocalFencedMessage". >>>>>>> >>>>>>> Indeed it's documented that the default value >>>>>>> for akka.ask.timeout="10 s" in >>>>>>> >>>>>>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#distributed-coordination-via-akka >>>>>>> >>>>>>> Behind the scenes the savepoint creation & job cancellation >>>>>>> succeeded, that was to be expected, kind of. So my problem is just >>>>>>> getting >>>>>>> a proper response back from the CLI call instead of timing out so >>>>>>> eagerly. >>>>>>> >>>>>>> To be exact, what I ran was: >>>>>>> >>>>>>> flink-1.5.2/bin/flink cancel b7c7d19d25e16a952d3afa32841024e5 -m >>>>>>> yarn-cluster -yid application_1533676784032_0001 --withSavepoint >>>>>>> >>>>>>> Should I change the akka.ask.timeout to have a longer timeout? If >>>>>>> yes, can I override it just for the CLI call somehow? Maybe it might >>>>>>> have >>>>>>> undesired side-effects if set globally for the actual flink jobs to use? >>>>>>> >>>>>>> What about akka.client.timeout? The default for it is also rather >>>>>>> low: "60 s". Should it also be increased accordingly if I want to accept >>>>>>> longer than 60 s for savepoint creation? >>>>>>> >>>>>>> Finally, that default timeout is so low that I would expect this to >>>>>>> be a common problem. I would say that Flink CLI should have higher >>>>>>> default >>>>>>> timeout for cancel and savepoint creation ops. >>>>>>> >>>>>>> Thanks! >>>>>>> >>>>>> >>>>> >>