Re: External DB as sink - with processing guarantees

Josh Sat, 12 Mar 2016 02:16:09 -0800

Hi Fabian,

Thanks, that's very helpful. Actually most of my writes will be idempotent so I 
guess that means I'll get the exact once guarantee using the Hadoop output 
format!


Thanks,
Josh

> On 12 Mar 2016, at 09:14, Fabian Hueske <fhue...@gmail.com> wrote:
> 
> Hi Josh,
> 
> Flink can guarantee exactly-once processing within its data flow given that 
> the data sources allow to replay data from a specific position in the stream. 
> For example, Flink's Kafka Consumer supports exactly-once.
> 
> Flink achieves exactly-once processing by resetting operator state to a 
> consistent state and replaying data. This means that data might actually be 
> processed more than once, but the operator state will reflect exactly-once 
> semantics because it was reset. Ensuring exactly-once end-to-end it 
> difficult, because Flink does not control (and cannot reset) the state of the 
> sinks. By default, data can be sent more than once to a sink resulting in 
> at-least-once semantics at the sink.
> 
> This issue can be addressed, if the sink provides transactional writes 
> (previous writes can be undone) or if the writes are idempotent (applying 
> them several times does not change the result). Transactional support would 
> need to be integrated with Flink's SinkFunction. This is not the case for 
> Hadoop OutputFormats. I am not familiar with the details of DynamoDB, but you 
> would need to implement a SinkFunction with transactional support or use 
> idempotent writes if you want to achieve exactly-once results.
> 
> Best, Fabian
> 
> 2016-03-12 9:57 GMT+01:00 Josh <jof...@gmail.com>:
>> Thanks Nick, that sounds good. I would still like to have an understanding 
>> of what determines the processing guarantee though. Say I use a DynamoDB 
>> Hadoop OutputFormat with Flink, how do I know what guarantee I have? And if 
>> it's at-least-once, is there a way to adapt it to achieve exactly-once?
>> 
>> Thanks,
>> Josh
>> 
>>> On 12 Mar 2016, at 02:46, Nick Dimiduk <ndimi...@gmail.com> wrote:
>>> 
>>> Pretty much anything you can write to from a Hadoop MapReduce program can 
>>> be a Flink destination. Just plug in the OutputFormat and go.
>>> 
>>> Re: output semantics, your mileage may vary. Flink should do you fine for 
>>> at least once.
>>> 
>>>> On Friday, March 11, 2016, Josh <jof...@gmail.com> wrote:
>>>> Hi all,
>>>> 
>>>> I want to use an external data store (DynamoDB) as a sink with Flink. It 
>>>> looks like there's no connector for Dynamo at the moment, so I have two 
>>>> questions:
>>>> 
>>>> 1. Is it easy to write my own sink for Flink and are there any docs around 
>>>> how to do this?
>>>> 2. If I do this, will I still be able to have Flink's processing 
>>>> guarantees? I.e. Can I be sure that every tuple has contributed to the 
>>>> DynamoDB state either at-least-once or exactly-once?
>>>> 
>>>> Thanks for any advice,
>>>> Josh
>

Re: External DB as sink - with processing guarantees

Reply via email to