Hi Fabian, Thanks, that's very helpful. Actually most of my writes will be idempotent so I guess that means I'll get the exact once guarantee using the Hadoop output format!
Thanks, Josh > On 12 Mar 2016, at 09:14, Fabian Hueske <fhue...@gmail.com> wrote: > > Hi Josh, > > Flink can guarantee exactly-once processing within its data flow given that > the data sources allow to replay data from a specific position in the stream. > For example, Flink's Kafka Consumer supports exactly-once. > > Flink achieves exactly-once processing by resetting operator state to a > consistent state and replaying data. This means that data might actually be > processed more than once, but the operator state will reflect exactly-once > semantics because it was reset. Ensuring exactly-once end-to-end it > difficult, because Flink does not control (and cannot reset) the state of the > sinks. By default, data can be sent more than once to a sink resulting in > at-least-once semantics at the sink. > > This issue can be addressed, if the sink provides transactional writes > (previous writes can be undone) or if the writes are idempotent (applying > them several times does not change the result). Transactional support would > need to be integrated with Flink's SinkFunction. This is not the case for > Hadoop OutputFormats. I am not familiar with the details of DynamoDB, but you > would need to implement a SinkFunction with transactional support or use > idempotent writes if you want to achieve exactly-once results. > > Best, Fabian > > 2016-03-12 9:57 GMT+01:00 Josh <jof...@gmail.com>: >> Thanks Nick, that sounds good. I would still like to have an understanding >> of what determines the processing guarantee though. Say I use a DynamoDB >> Hadoop OutputFormat with Flink, how do I know what guarantee I have? And if >> it's at-least-once, is there a way to adapt it to achieve exactly-once? >> >> Thanks, >> Josh >> >>> On 12 Mar 2016, at 02:46, Nick Dimiduk <ndimi...@gmail.com> wrote: >>> >>> Pretty much anything you can write to from a Hadoop MapReduce program can >>> be a Flink destination. Just plug in the OutputFormat and go. >>> >>> Re: output semantics, your mileage may vary. Flink should do you fine for >>> at least once. >>> >>>> On Friday, March 11, 2016, Josh <jof...@gmail.com> wrote: >>>> Hi all, >>>> >>>> I want to use an external data store (DynamoDB) as a sink with Flink. It >>>> looks like there's no connector for Dynamo at the moment, so I have two >>>> questions: >>>> >>>> 1. Is it easy to write my own sink for Flink and are there any docs around >>>> how to do this? >>>> 2. If I do this, will I still be able to have Flink's processing >>>> guarantees? I.e. Can I be sure that every tuple has contributed to the >>>> DynamoDB state either at-least-once or exactly-once? >>>> >>>> Thanks for any advice, >>>> Josh >