Re: Flink failure recovery tooks very long time

trung kien Thu, 06 Sep 2018 07:32:13 -0700

And here is the snapshot of my checkpoint metrics in normal condition.


On Thu, Sep 6, 2018 at 9:21 AM trung kien <kient...@gmail.com> wrote:

> Hi Yun,
>
> Yes, the job’s status change to Running pretty fast after failure (~ 1
> min).
>
> As soon as the status change to running, first checkpoint is kick off and
> it took 30 mins. I need to have exactly-one as i maintining some
> aggregation metric, do you know whats the diffrent between first checkpoint
> and checkpoints after that? (it’s fairely quick after that)
>
> Here is size of my checkpoints ( i config to keep 5 latest checkpoints)
> 449M    chk-1626
> 775M    chk-1627
> 486M    chk-1628
> 7.8G    chk-1629
> 7.5G    chk-1630
>
> I dont know why the size is too diffrent.
> Metrics on checkpoints looks good as besides the spike in the first
> checkpoint, everything looks fine.
>
> @Vino: Yes, i can try to switch to DEBuG to see if i got any information.
>
>
> On Thu, Sep 6, 2018 at 7:09 AM vino yang <yanghua1...@gmail.com> wrote:
>
>> Hi trung,
>>
>> Can you provide more information to aid in positioning? For example, the
>> size of the state generated by a checkpoint and more log information, you
>> can try to switch the log level to DEBUG.
>>
>> Thanks, vino.
>>
>> Yun Tang <myas...@live.com> 于2018年9月6日周四 下午7:42写道：
>>
>>> Hi Kien
>>>
>>> From your description, your job has already started to execute
>>> checkpoint after job failover, which means your job was in RUNNING status.
>>> From my point of view, the actual recovery time should be the time during
>>> job's status: RESTARTING->CREATED->RUNNING[1].
>>> Your trouble sounds more like the long time needed for the first
>>> checkpoint to complete after job failover. Afaik, It's probably because
>>> your job is heavily back pressured after the failover and the checkpoint
>>> mode is exactly-once, some operators need to receive all the input
>>> checkpoint barrier to trigger the checkpoint. You can watch your metrics of
>>> checkpoint alignment time to verify the root cause, and if you do not need
>>> the exactly once guarantees, you can change the checkpoint mode to
>>> at-least-once[2].
>>>
>>> Best
>>> Yun Tang
>>>
>>>
>>> [1]
>>> https://ci.apache.org/projects/flink/flink-docs-master/internals/job_scheduling.html#jobmanager-data-structures
>>> [2]
>>> https://ci.apache.org/projects/flink/flink-docs-stable/internals/stream_checkpointing.html#exactly-once-vs-at-least-once
>>>
>>> <https://ci.apache.org/projects/flink/flink-docs-stable/internals/stream_checkpointing.html#exactly-once-vs-at-least-once>
>>> Apache Flink 1.6 Documentation: Data Streaming Fault Tolerance
>>> <https://ci.apache.org/projects/flink/flink-docs-stable/internals/stream_checkpointing.html#exactly-once-vs-at-least-once>
>>> Apache Flink offers a fault tolerance mechanism to consistently recover
>>> the state of data streaming applications. The mechanism ensures that even
>>> in the presence of failures, the program’s state will eventually reflect
>>> every record from the data stream exactly once. Note that there is a switch
>>> to ...
>>> ci.apache.org
>>>
>>> ------------------------------
>>> *From:* trung kien <kient...@gmail.com>
>>> *Sent:* Thursday, September 6, 2018 18:50
>>> *To:* user@flink.apache.org
>>> *Subject:* Flink failure recovery tooks very long time
>>>
>>> Hi all,
>>>
>>> I am trying to test failure recovery of a Flink job when a JM or TM goes
>>> down.
>>> Our target is having job auto restart and back to normal condition in
>>> any case.
>>>
>>> However, what's I am seeing is very strange and hope someone here help
>>> me to understand it.
>>>
>>> When JM or TM went down, I see the job was being restarted but as soon
>>> as it restart it's working on checkingpoint and usually took 30+ minutes to
>>> finish (usually in normal case, it only take 1-2 mins for checkpoint), As
>>> soon as the checkpoint is finish, the job is back to normal condition.
>>>
>>> I'm using 1.4.2, but seeing similar thing on 1.6.0 as well.
>>>
>>> Could anyone please help to explain this behavior? We really want to
>>> reduce the time of recovery but doesn't seem to find any document mentioned
>>> about recovery process in detail.
>>>
>>> Any help is really appreciate.
>>>
>>>
>>> --
>>> Thanks
>>> Kien
>>> --
>>> Thanks
>>> Kien
>>>
>> --
> Thanks
> Kien
>
-- 
Thanks
Kien

Re: Flink failure recovery tooks very long time

Reply via email to