Re: flink snapshotting fault-tolerance

Paris Carbone Thu, 19 May 2016 11:36:18 -0700

Invalidations are not necessarily exposed (I hope). Think of it as implementing 
TCP, you don’t have to warn the user that packets are lost since eventually a 
packet will be received at the other side in an eventually sunchronous system. 
Snapshots follow the same paradigm. Hope that helps.

On 19 May 2016, at 20:33, Stavros Kontopoulos
<st.kontopou...@gmail.com<mailto:st.kontopou...@gmail.com>> wrote:

Yes thats what i was thinking thnx. When people here exactly once they think
are you sure, there is something hidden there... because theory is theory :)
So if you keep getting invalidated snapshots but data passes through operators
you issue a warning or fail the pipeline and return an exception to the driver?

On Thu, May 19, 2016 at 9:30 PM, Paris Carbone
<par...@kth.se<mailto:par...@kth.se>> wrote:
In that case, typically a timeout invalidates the whole snapshot (all states
for the same epoch) until eventually we have a full complete snapshot.

On 19 May 2016, at 20:26, Stavros Kontopoulos
<st.kontopou...@gmail.com<mailto:st.kontopou...@gmail.com>> wrote:

"Checkpoints are only confirmed if all parallel subtasks successfully created a
valid snapshot of the state." as stated by Robert. So to rephrase my
question... how confirmation that all snapshots are finished is done and what
happens if some task is very slow...or is blocked?
If you have N tasks confirmed and one missing what do you do? You start a new
checkpoint for that one? or a global new checkpoint for the rest of N tasks as
well?

On Thu, May 19, 2016 at 9:21 PM, Paris Carbone
<par...@kth.se<mailto:par...@kth.se>> wrote:

Regarding your last question,
If a checkpoint expires it just gets invalidated and a new complete checkpoint
will eventually occur that can be used for recovery. If I am wrong, or
something has changed please correct me.

Paris

On 19 May 2016, at 20:14, Paris Carbone <par...@kth.se<mailto:par...@kth.se>>
wrote:

Hi Stavros,

Currently, rollback failure recovery in Flink works in the pipeline level, not
in the task level (see Millwheel [1]). It further builds on repayable stream
logs (i.e. Kafka), thus, there is no need for 3pc or backup in the pipeline
sources. You can also check this presentation [2] which explains the basic
concepts more in detail I hope. Mind that many upcoming optimisation
opportunities are going to be addressed in the not so long-term Flink roadmap.

Paris

[1]
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41378.pdf
[2]
http://www.slideshare.net/ParisCarbone/tech-talk-google-on-flink-fault-tolerance-and-ha

<http://www.slideshare.net/ParisCarbone/tech-talk-google-on-flink-fault-tolerance-and-ha>

<http://www.slideshare.net/ParisCarbone/tech-talk-google-on-flink-fault-tolerance-and-ha>
On 19 May 2016, at 19:43, Stavros Kontopoulos
<st.kontopou...@gmail.com<mailto:st.kontopou...@gmail.com>> wrote:

Cool thnx. So if a checkpoint expires the pipeline will block or fail in total
or only the specific task related to the operator (running along with the
checkpoint task) or nothing happens?

On Tue, May 17, 2016 at 3:49 PM, Robert Metzger
<rmetz...@apache.org<mailto:rmetz...@apache.org>> wrote:
Hi Stravos,

I haven't implemented our checkpointing mechanism and I didn't participate in
the design decisions while implementing it, so I can not compare it in detail
to other approaches.

From a "does it work perspective": Checkpoints are only confirmed if all
parallel subtasks successfully created a valid snapshot of the state. So if
there is a failure in the checkpointing mechanism, no valid checkpoint will be
created. The system will recover from the last valid checkpoint.
There is a timeout for checkpoints. So if a barrier doesn't pass through the
system for a certain period of time, the checkpoint is cancelled. The default
timeout is 10 minutes.

Regards,
Robert

On Mon, May 16, 2016 at 1:22 PM, Stavros Kontopoulos
<st.kontopou...@gmail.com<mailto:st.kontopou...@gmail.com>> wrote:
Hi,

I was looking into the flink snapshotting algorithm details also mentioned here:
http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/
https://blog.acolyer.org/2015/08/19/asynchronous-distributed-snapshots-for-distributed-dataflows/
http://mail-archives.apache.org/mod_mbox/flink-user/201601.mbox/%3CCANC1h_s6MCWSuDf2zSnEeD66LszDoLx0jt64++0kBOKTjkAv7w%40mail.gmail.com%3E
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/About-exactly-once-question-td2545.html

From other sources i understand that it assumes no failures to work for message
delivery or for example a process hanging for ever:
https://en.wikipedia.org/wiki/Snapshot_algorithm
https://blog.acolyer.org/2015/04/22/distributed-snapshots-determining-global-states-of-distributed-systems/

So my understanding (maybe wrong) is that this is a solution which seems not to
address the fault tolerance issue in a strong manner like for example if it was
to use a 3pc protocol for local state propagation and global agreement. I know
the latter is not efficient just mentioning it for comparison.

How the algorithm behaves in practical terms under the presence of its own
failures (this is a background process collecting partial states)? Are there
timeouts for reaching a barrier?

PS. have not looked deep into the code details yet, planning to.

Best,
Stavros

Re: flink snapshotting fault-tolerance

Reply via email to