Re: When should the RETAIN_ON_CANCELLATION option be used?

徐涛 Mon, 24 Sep 2018 23:59:50 -0700

 Hi Vino,
        So I will use the default setting of DELETE_ON_CANCELLATION. When the 
program cancels the checkpoint will be deleted, when the program fails,because 
the checkpoint will not be deleted, I still can have a checkpoint that can be 
used to resume.
        Please help to correct me if I am wrong.


        Thanks.

Best 
Henry

> 在 2018年9月25日，下午2:22，vino yang <yanghua1...@gmail.com> 写道：
> 
> Hi Henry,
> 
> I gave a blue comment in your original email.
> 
> Thanks, vino.
> 
> 徐涛 <happydexu...@gmail.com <mailto:happydexu...@gmail.com>> 于2018年9月25日周二 
> 下午12:56写道：
> Hi Vino,
>       What is the definition and difference between job cancel and job fails?
>       Can I say that if the program is shutdown artificially, then it is a 
> job cancel,
>                              if the program is shutdown due to some error, it 
> is a job fail?
> 
> 
> This is not entirely true, and artificially triggering a cancel may also lead 
> to failure. You can think that if the human triggers the cancel, each task 
> instance can be correctly canceled, then the final job's status is canceled. 
> The final state of the job due to various anomalies is failed.
>  
>       This is important because it is the prerequisite for the following 
> question:
> 
>       In the document of Flink 1.6, it says:
>       "ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION: Retain the 
> checkpoint when the job is cancelled. Note that you have to manually clean up 
> the checkpoint state after cancellation in this case.    
>         ExternalizedCheckpointCleanup.DELETE_ON_CANCELLATION: Delete the 
> checkpoint when the job is cancelled. The checkpoint state will only be 
> available if the job fails."
>       But it does not says whether the checkpoint will be retained on fail.
>       If the checkpoint activity of fail is the same as cancel, then I have 
> to use RETAIL_ON_CANCELLATION, because if I do not use it, the checkpoint 
> will be deleted on job fail.
>       If the checkpoint activity of fail is not delete, then at this case it 
> is safe on job fail.
> 
> In the configuration, there are two enumeration classes 
> `CheckpointRetentionPolicy` and `ExternalizedCheckpointCleanup`, you need to 
> consider which configuration you want to use. Your main concern is 
> ExternalizedCheckpointCleanup, which cleans up the metadata for externalized 
> checkpoints. Are you sure you want to use it? Flink defaults to 
> self-management checkpoint cleanup, which is a non-externalized checkpoint.
>  
>       
> Best 
> Henry 
>       
> 
> 
>> 在 2018年9月25日，上午11:16，vino yang <yanghua1...@gmail.com 
>> <mailto:yanghua1...@gmail.com>> 写道：
>> 
>> Hi Henry,
>> 
>> Answer your question:
>> 
>> What is the definition and difference between job cancel and job fails?
>> 
>> > The cancellation and failure of the job will cause the job to enter the 
>> > termination state. But cancellation is artificially triggered and normally 
>> > terminated, while failure is usually a passive termination due to an 
>> > exception.
>> 
>> If I use DELETE_ON_CANCELLATION option, in this case, does I have the 
>> checkpoint to resume the program?
>> 
>> > No, if you use externalized checkpoints. you cannot resume from 
>> > externalized checkpoints after the job has been cancelled.
>> 
>> I mean if I can guarantee that a savepoint can always be made before 
>> manually cancelation. If I use DELETE_ON_CANCELLATION option on checkpoints, 
>> is there any probability that I do not have a checkpoint to recover from?
>> 
>> > From the latest source code, savepoint is not affected by 
>> > CheckpointRetentionPolicy, it needs to be cleaned up manually.
>> 
>> Thanks, vino.
>> 
>> 徐涛 <happydexu...@gmail.com <mailto:happydexu...@gmail.com>> 于2018年9月25日周二 
>> 上午11:06写道：
>> Hi All,
>>      I mean if I can guarantee that a savepoint can always be made before 
>> manually cancelation. If I use DELETE_ON_CANCELLATION option on checkpoints, 
>> is there any probability that I do not have a checkpoint to recover from?
>>      Thank a a lot.
>> 
>> Best
>> Henry
>> 
>> 
>> 
>>> 在 2018年9月25日，上午10:41，徐涛 <happydexu...@gmail.com 
>>> <mailto:happydexu...@gmail.com>> 写道：
>>> 
>>> Hi All,
>>>     In flink document, it says
>>>     DELETE_ON_CANCELLATION: “Delete the checkpoint when the job is 
>>> cancelled. The checkpoint state will only be available if the job fails.”
>>>     What is the definition and difference between job cancel and job fails? 
>>> If I run the program on yarn, and after a few days, the yarn application 
>>> get failed for some reason.
>>>     If I use DELETE_ON_CANCELLATION option, in this case, does I have the 
>>> checkpoint to resume the program?
>>> 
>>>     If the checkpoint are only deleted when I cancel the program, I can 
>>> always make the savepoint before cancelation. Then it seems that I can only 
>>> set DELETE_ON_CANCELLATION then.
>>>     I can not find a case that RETAIN_ON_CANCELLATION should be used.
>>>     
>>> 
>>> Best
>>> Henry
>>> 
>> 
>

Re: When should the RETAIN_ON_CANCELLATION option be used?

Reply via email to