And here is the snapshot of my checkpoint metrics in normal condition.
On Thu, Sep 6, 2018 at 9:21 AM trung kien <kient...@gmail.com> wrote: > Hi Yun, > > Yes, the job’s status change to Running pretty fast after failure (~ 1 > min). > > As soon as the status change to running, first checkpoint is kick off and > it took 30 mins. I need to have exactly-one as i maintining some > aggregation metric, do you know whats the diffrent between first checkpoint > and checkpoints after that? (it’s fairely quick after that) > > Here is size of my checkpoints ( i config to keep 5 latest checkpoints) > 449M chk-1626 > 775M chk-1627 > 486M chk-1628 > 7.8G chk-1629 > 7.5G chk-1630 > > I dont know why the size is too diffrent. > Metrics on checkpoints looks good as besides the spike in the first > checkpoint, everything looks fine. > > @Vino: Yes, i can try to switch to DEBuG to see if i got any information. > > > On Thu, Sep 6, 2018 at 7:09 AM vino yang <yanghua1...@gmail.com> wrote: > >> Hi trung, >> >> Can you provide more information to aid in positioning? For example, the >> size of the state generated by a checkpoint and more log information, you >> can try to switch the log level to DEBUG. >> >> Thanks, vino. >> >> Yun Tang <myas...@live.com> 于2018年9月6日周四 下午7:42写道: >> >>> Hi Kien >>> >>> From your description, your job has already started to execute >>> checkpoint after job failover, which means your job was in RUNNING status. >>> From my point of view, the actual recovery time should be the time during >>> job's status: RESTARTING->CREATED->RUNNING[1]. >>> Your trouble sounds more like the long time needed for the first >>> checkpoint to complete after job failover. Afaik, It's probably because >>> your job is heavily back pressured after the failover and the checkpoint >>> mode is exactly-once, some operators need to receive all the input >>> checkpoint barrier to trigger the checkpoint. You can watch your metrics of >>> checkpoint alignment time to verify the root cause, and if you do not need >>> the exactly once guarantees, you can change the checkpoint mode to >>> at-least-once[2]. >>> >>> Best >>> Yun Tang >>> >>> >>> [1] >>> https://ci.apache.org/projects/flink/flink-docs-master/internals/job_scheduling.html#jobmanager-data-structures >>> [2] >>> https://ci.apache.org/projects/flink/flink-docs-stable/internals/stream_checkpointing.html#exactly-once-vs-at-least-once >>> >>> <https://ci.apache.org/projects/flink/flink-docs-stable/internals/stream_checkpointing.html#exactly-once-vs-at-least-once> >>> Apache Flink 1.6 Documentation: Data Streaming Fault Tolerance >>> <https://ci.apache.org/projects/flink/flink-docs-stable/internals/stream_checkpointing.html#exactly-once-vs-at-least-once> >>> Apache Flink offers a fault tolerance mechanism to consistently recover >>> the state of data streaming applications. The mechanism ensures that even >>> in the presence of failures, the program’s state will eventually reflect >>> every record from the data stream exactly once. Note that there is a switch >>> to ... >>> ci.apache.org >>> >>> ------------------------------ >>> *From:* trung kien <kient...@gmail.com> >>> *Sent:* Thursday, September 6, 2018 18:50 >>> *To:* user@flink.apache.org >>> *Subject:* Flink failure recovery tooks very long time >>> >>> Hi all, >>> >>> I am trying to test failure recovery of a Flink job when a JM or TM goes >>> down. >>> Our target is having job auto restart and back to normal condition in >>> any case. >>> >>> However, what's I am seeing is very strange and hope someone here help >>> me to understand it. >>> >>> When JM or TM went down, I see the job was being restarted but as soon >>> as it restart it's working on checkingpoint and usually took 30+ minutes to >>> finish (usually in normal case, it only take 1-2 mins for checkpoint), As >>> soon as the checkpoint is finish, the job is back to normal condition. >>> >>> I'm using 1.4.2, but seeing similar thing on 1.6.0 as well. >>> >>> Could anyone please help to explain this behavior? We really want to >>> reduce the time of recovery but doesn't seem to find any document mentioned >>> about recovery process in detail. >>> >>> Any help is really appreciate. >>> >>> >>> -- >>> Thanks >>> Kien >>> -- >>> Thanks >>> Kien >>> >> -- > Thanks > Kien > -- Thanks Kien