Hi Fabian,
Thanks, that's very helpful. Actually most of my writes will be idempotent so I
guess that means I'll get the exact once guarantee using the Hadoop output
format!
Thanks,
Josh
> On 12 Mar 2016, at 09:14, Fabian Hueske wrote:
>
> Hi Josh,
>
> Flink can guarantee exactly-once processing within its data flow given that
> the data sources allow to replay data from a specific position in the stream.
> For example, Flink's Kafka Consumer supports exactly-once.
>
> Flink achieves exactly-once processing by resetting operator state to a
> consistent state and replaying data. This means that data might actually be
> processed more than once, but the operator state will reflect exactly-once
> semantics because it was reset. Ensuring exactly-once end-to-end it
> difficult, because Flink does not control (and cannot reset) the state of the
> sinks. By default, data can be sent more than once to a sink resulting in
> at-least-once semantics at the sink.
>
> This issue can be addressed, if the sink provides transactional writes
> (previous writes can be undone) or if the writes are idempotent (applying
> them several times does not change the result). Transactional support would
> need to be integrated with Flink's SinkFunction. This is not the case for
> Hadoop OutputFormats. I am not familiar with the details of DynamoDB, but you
> would need to implement a SinkFunction with transactional support or use
> idempotent writes if you want to achieve exactly-once results.
>
> Best, Fabian
>
> 2016-03-12 9:57 GMT+01:00 Josh :
>> Thanks Nick, that sounds good. I would still like to have an understanding
>> of what determines the processing guarantee though. Say I use a DynamoDB
>> Hadoop OutputFormat with Flink, how do I know what guarantee I have? And if
>> it's at-least-once, is there a way to adapt it to achieve exactly-once?
>>
>> Thanks,
>> Josh
>>
>>> On 12 Mar 2016, at 02:46, Nick Dimiduk wrote:
>>>
>>> Pretty much anything you can write to from a Hadoop MapReduce program can
>>> be a Flink destination. Just plug in the OutputFormat and go.
>>>
>>> Re: output semantics, your mileage may vary. Flink should do you fine for
>>> at least once.
>>>
On Friday, March 11, 2016, Josh wrote:
Hi all,
I want to use an external data store (DynamoDB) as a sink with Flink. It
looks like there's no connector for Dynamo at the moment, so I have two
questions:
1. Is it easy to write my own sink for Flink and are there any docs around
how to do this?
2. If I do this, will I still be able to have Flink's processing
guarantees? I.e. Can I be sure that every tuple has contributed to the
DynamoDB state either at-least-once or exactly-once?
Thanks for any advice,
Josh
>