@Gabor: That assumes deterministic streams and to some extend deterministic tuple order. That may be given sometimes, but it is a very strong assumption in many cases.
On Fri, Feb 5, 2016 at 1:09 PM, Gábor Gévay <gga...@gmail.com> wrote: > Hello, > > > I think that there is actually a fundamental latency issue with > > "exactly once sinks", no matter how you implement them in any systems: > > You can only commit once you are sure that everything went well, > > to a specific point where you are sure no replay will ever be needed. > > What if the persistent buffer in the sink would be used to determine > which data elements should be emitted in case of a replay? I mean, the > sink pushes everything as soon as it arrives, and also writes > everything to the persistent buffer, and then in case of a replay it > looks into the buffer before pushing every element, and only does the > push if the buffer says that the element was not pushed before. > > Best, > Gábor > > > 2016-02-05 11:57 GMT+01:00 Stephan Ewen <se...@apache.org>: > > Hi Niels! > > > > In general, exactly once output requires transactional cooperation from > the > > target system. Kafka has that on the roadmap, we should be able to > integrate > > that once it is out. > > That means output is "committed" upon completed checkpoints, which > > guarantees nothing is written multiple times. > > > > Chesnay is working on an interesting prototype as a generic solution > (also > > for Kafka, while they don't have that feature): > > It buffers the data in the sink persistently (using the fault tolerance > > state backends) and pushes the results out on notification of a completed > > checkpoint. > > That gives you exactly once semantics, but involves an extra > materialization > > of the data. > > > > > > I think that there is actually a fundamental latency issue with "exactly > > once sinks", no matter how you implement them in any systems: > > You can only commit once you are sure that everything went well, to a > > specific point where you are sure no replay will ever be needed. > > > > So the latency in Flink for an exactly-once output would be at least the > > checkpoint interval. > > > > I'm eager to hear your thoughts on this. > > > > Greetings, > > Stephan > > > > > > On Fri, Feb 5, 2016 at 11:17 AM, Niels Basjes <ni...@basjes.nl> wrote: > >> > >> Hi, > >> > >> It is my understanding that the exactly-once semantics regarding the > input > >> from Kafka is based on the checkpointing in the source component > retaining > >> the offset where it was at the checkpoint moment. > >> > >> My question is how does that work for a sink? How can I make sure that > (in > >> light of failures) each message that is read from Kafka (my input) is > >> written to Kafka (my output) exactly once? > >> > >> > >> -- > >> Best regards / Met vriendelijke groeten, > >> > >> Niels Basjes > > > > >