Re: Checkpointing under backpressure

2020-07-31 Thread Piotr Nowojski
Thanks for the update and write up Arvid. Piotrek czw., 30 lip 2020 o 11:05 Arvid Heise napisał(a): > Dear all, > > I just wanted to follow-up on this long discussion thread by announcing > that we implemented unaligned checkpoints in Flink 1.11. If you experience > long end-to-end

Re: Checkpointing under backpressure

2020-07-30 Thread Arvid Heise
Dear all, I just wanted to follow-up on this long discussion thread by announcing that we implemented unaligned checkpoints in Flink 1.11. If you experience long end-to-end checkpointing duration, you should try out unaligned checkpoints [1] if the following applies: - Checkpointing is not

Re: Checkpointing under backpressure

2019-12-04 Thread Thomas Weise
Hi Arvid, Thanks for putting together the proposal [1] I'm planning to take a closer look in the next few days. Has any of the work been translated to JIRAs yet and what would be the approximate target release? Thanks, Thomas [1]

Re: Checkpointing under backpressure

2019-10-02 Thread Arvid Heise
Sry incorrect link, please follow [1]. [1] https://mail-archives.apache.org/mod_mbox/flink-dev/201909.mbox/%3CCAGZNd0FgVL0oDQJHpBwJ1Ha8QevsVG0FHixdet11tLhW2p-2hg%40mail.gmail.com%3E On Wed, Oct 2, 2019 at 3:44 PM Arvid Heise wrote: > FYI, we published FLIP-76 to address the issue and

Re: Checkpointing under backpressure

2019-10-02 Thread Arvid Heise
FYI, we published FLIP-76 to address the issue and discussion has been opened in [1]. Looking forward to your feedback, Arvid [1] https://mail-archives.apache.org/mod_mbox/flink-dev/201909.mbox/browser On Thu, Aug 15, 2019 at 9:43 AM Yun Gao wrote: > Hi, > Very thanks for the great

Re: Checkpointing under backpressure

2019-08-15 Thread Yun Gao
Hi, Very thanks for the great points! For the prioritizing inputs, from another point of view, I think it might not cause other bad effects, since we do not need to totally block the channels that have seen barriers after the operator has taking snapshot. After the snapshotting, if the

Re: Checkpointing under backpressure

2019-08-15 Thread Stephan Ewen
@Thomas just to double check: - parallelism and configuration changes should be well possible on unaligned checkpoints - changes in state types and JobGraph structure would be tricky, and changing the on-the-wire types would not be possible. On Wed, Aug 14, 2019 at 7:48 PM Thomas Weise

Re: Checkpointing under backpressure

2019-08-14 Thread Thomas Weise
--> On Wed, Aug 14, 2019 at 10:23 AM zhijiang wrote: > Thanks for these great points and disccusions! > > 1. Considering the way of triggering checkpoint RPC calls to all the tasks > from Chandy Lamport, it combines two different mechanisms together to make > sure that the trigger could be fast

Re: Checkpointing under backpressure

2019-08-14 Thread zhijiang
Thanks for these great points and disccusions! 1. Considering the way of triggering checkpoint RPC calls to all the tasks from Chandy Lamport, it combines two different mechanisms together to make sure that the trigger could be fast in different scenarios. But in flink world it might be not

Re: Checkpointing under backpressure

2019-08-14 Thread Piotr Nowojski
> Thanks for the great ideas so far. +1 Regarding other things raised, I mostly agree with Stephan. I like the idea of simultaneously starting the checkpoint everywhere via RPC call (especially in cases where Tasks are busy doing some synchronous operations for example for tens of

Re: Checkpointing under backpressure

2019-08-14 Thread Paris Carbone
Sure I see. In cases when no periodic aligned snapshots are employed this is the only option. Two things that were not highlighted enough so far on the proposed protocol (included my mails): - The Recovery/Reconfiguration strategy should strictly prioritise processing logged events

Re: Checkpointing under backpressure

2019-08-14 Thread Stephan Ewen
Scaling with unaligned checkpoints might be a necessity. Let's assume the job failed due to a lost TaskManager, but no new TaskManager becomes available. In that case we need to scale down based on the latest complete checkpoint, because we cannot produce a new checkpoint. On Wed, Aug 14, 2019

Re: Checkpointing under backpressure

2019-08-14 Thread Paris Carbone
+1 I think we are on the same page Stephan. Rescaling on unaligned checkpoint sounds challenging and a bit unnecessary. No? Why not sticking to aligned snapshots for live reconfiguration/rescaling? It’s a pretty rare operation and it would simplify things by a lot. Everything can be “staged”

Re: Checkpointing under backpressure

2019-08-14 Thread Paris Carbone
Thanks for the responses. Starts getting a bit more clear for everyone now. @Zhuzhu overlapping unaligned snapshots should be aborted/avoided imho. @Piotr point II, it was a little too quickly written, sorry about that. Simply put the two following approaches are equivalent for a valid

Re: Checkpointing under backpressure

2019-08-14 Thread Stephan Ewen
Hi all! Yes, the first proposal of "unaligend checkpoints" (probably two years back now) drew a major inspiration from Chandy Lamport, as did actually the original checkpointing algorithm. "Logging data between first and last barrier" versus "barrier jumping over buffer and storing those

Re: Checkpointing under backpressure

2019-08-14 Thread Yun Gao
Hi, Very thanks for sharing the thoughts on the unaligned checkpoint ! Another question regarding I 2.C (Performance) by Paris is that do we always snapshot and broadcast the marks once the task receives the first mark from JM o? If so, then we will always need to snapshot all the

Re: Checkpointing under backpressure

2019-08-14 Thread Piotr Nowojski
Hi again, Zhu Zhu let me think about this more. Maybe as Paris is writing, we do not need to block any channels at all, at least assuming credit base flow control. Regarding what should happen with the following checkpoint is another question. Also, should we support concurrent checkpoints and

Re: Checkpointing under backpressure

2019-08-14 Thread Paris Carbone
Now I see a little more clearly what you have in mind. Thanks for the explanation! There are a few intermixed concepts here, some how to do with correctness some with performance. Before delving deeper I will just enumerate a few things to make myself a little more helpful if I can. I.

Re: Checkpointing under backpressure

2019-08-14 Thread Zhu Zhu
Thanks Piotr and Zhijiang for sharing the thoughts on unaligned checkpointing and the barrier overtaking. I have a question about 2.d) in Piotr's last mail that states "the Task first has to process the buffered data after that it can unblock the reads from the channels". Does this mean that we

Re: Checkpointing under backpressure

2019-08-14 Thread Piotr Nowojski
Hi, Thomas: There are no Jira tickets yet (or maybe there is something very old somewhere). First we want to discuss it, next present FLIP and at last create tickets :) > if I understand correctly, then the proposal is to not block any > input channel at all, but only log data from the

Re: Checkpointing under backpressure

2019-08-14 Thread zhijiang
Hi Thomas, There are no Jira tickets or discussions at the moment. If there are any updates I would ping you. I agree that there is a benefit to firstly read other channels without barrier in high priority, otherwise it seems waste cpu resource to migrate blocked buffers to another cached

Re: Checkpointing under backpressure

2019-08-13 Thread zhijiang
Hi Paris, Thanks for the detailed sharing. And I think it is very similar with the way of overtaking we proposed before. There are some tiny difference: The way of overtaking might need to snapshot all the input/output queues. Chandy Lamport seems only need to snaphost (n-1) input channels

Re: Checkpointing under backpressure

2019-08-13 Thread Paris Carbone
yes! It’s quite similar I think. Though mind that the devil is in the details, i.e., the temporal order actions are taken. To clarify, let us say you have a task T with two input channels I1 and I2. The Chandy Lamport execution flow is the following: 1) T receives barrier from I1 and... 2)

Re: Checkpointing under backpressure

2019-08-13 Thread Piotr Nowojski
Thanks for the input. Regarding the Chandy-Lamport snapshots don’t you still have to wait for the “checkpoint barrier” to arrive in order to know when have you already received all possible messages from the upstream tasks/operators? So instead of processing the “in flight” messages (as the

Re: Checkpointing under backpressure

2019-08-13 Thread Paris Carbone
Interesting problem! Thanks for bringing it up Thomas. Ignore/Correct me if I am wrong but I believe Chandy-Lamport snapshots [1] would help out solve this problem more elegantly without sacrificing correctness. - They do not need alignment, only (async) logging for in-flight records between

Re: Checkpointing under backpressure

2019-08-13 Thread Piotr Nowojski
Hi Thomas, As Zhijiang has responded, we are now in the process of discussing how to address this issue and one of the solution that we are discussing is exactly what you are proposing: checkpoint barriers overtaking the in flight data and make the in flight data part of the checkpoint. If

Re: Checkpointing under backpressure

2019-08-12 Thread zhijiang
Hi Thomas, Thanks for proposing this concern. The barrier alignment takes long time in backpressure case which could cause several problems: 1. Checkpoint timeout as you mentioned. 2. The recovery cost is high once failover, because much data needs to be replayed. 3. The delay for commit-based