Re: External DB as sink - with processing guarantees

2016-03-12 Thread Josh
Hi Fabian,

Thanks, that's very helpful. Actually most of my writes will be idempotent so I 
guess that means I'll get the exact once guarantee using the Hadoop output 
format!

Thanks,
Josh

> On 12 Mar 2016, at 09:14, Fabian Hueske  wrote:
> 
> Hi Josh,
> 
> Flink can guarantee exactly-once processing within its data flow given that 
> the data sources allow to replay data from a specific position in the stream. 
> For example, Flink's Kafka Consumer supports exactly-once.
> 
> Flink achieves exactly-once processing by resetting operator state to a 
> consistent state and replaying data. This means that data might actually be 
> processed more than once, but the operator state will reflect exactly-once 
> semantics because it was reset. Ensuring exactly-once end-to-end it 
> difficult, because Flink does not control (and cannot reset) the state of the 
> sinks. By default, data can be sent more than once to a sink resulting in 
> at-least-once semantics at the sink.
> 
> This issue can be addressed, if the sink provides transactional writes 
> (previous writes can be undone) or if the writes are idempotent (applying 
> them several times does not change the result). Transactional support would 
> need to be integrated with Flink's SinkFunction. This is not the case for 
> Hadoop OutputFormats. I am not familiar with the details of DynamoDB, but you 
> would need to implement a SinkFunction with transactional support or use 
> idempotent writes if you want to achieve exactly-once results.
> 
> Best, Fabian
> 
> 2016-03-12 9:57 GMT+01:00 Josh :
>> Thanks Nick, that sounds good. I would still like to have an understanding 
>> of what determines the processing guarantee though. Say I use a DynamoDB 
>> Hadoop OutputFormat with Flink, how do I know what guarantee I have? And if 
>> it's at-least-once, is there a way to adapt it to achieve exactly-once?
>> 
>> Thanks,
>> Josh
>> 
>>> On 12 Mar 2016, at 02:46, Nick Dimiduk  wrote:
>>> 
>>> Pretty much anything you can write to from a Hadoop MapReduce program can 
>>> be a Flink destination. Just plug in the OutputFormat and go.
>>> 
>>> Re: output semantics, your mileage may vary. Flink should do you fine for 
>>> at least once.
>>> 
 On Friday, March 11, 2016, Josh  wrote:
 Hi all,
 
 I want to use an external data store (DynamoDB) as a sink with Flink. It 
 looks like there's no connector for Dynamo at the moment, so I have two 
 questions:
 
 1. Is it easy to write my own sink for Flink and are there any docs around 
 how to do this?
 2. If I do this, will I still be able to have Flink's processing 
 guarantees? I.e. Can I be sure that every tuple has contributed to the 
 DynamoDB state either at-least-once or exactly-once?
 
 Thanks for any advice,
 Josh
> 


Re: External DB as sink - with processing guarantees

2016-03-12 Thread Josh
Thanks Nick, that sounds good. I would still like to have an understanding of 
what determines the processing guarantee though. Say I use a DynamoDB Hadoop 
OutputFormat with Flink, how do I know what guarantee I have? And if it's 
at-least-once, is there a way to adapt it to achieve exactly-once?

Thanks,
Josh

> On 12 Mar 2016, at 02:46, Nick Dimiduk  wrote:
> 
> Pretty much anything you can write to from a Hadoop MapReduce program can be 
> a Flink destination. Just plug in the OutputFormat and go.
> 
> Re: output semantics, your mileage may vary. Flink should do you fine for at 
> least once.
> 
>> On Friday, March 11, 2016, Josh  wrote:
>> Hi all,
>> 
>> I want to use an external data store (DynamoDB) as a sink with Flink. It 
>> looks like there's no connector for Dynamo at the moment, so I have two 
>> questions:
>> 
>> 1. Is it easy to write my own sink for Flink and are there any docs around 
>> how to do this?
>> 2. If I do this, will I still be able to have Flink's processing guarantees? 
>> I.e. Can I be sure that every tuple has contributed to the DynamoDB state 
>> either at-least-once or exactly-once?
>> 
>> Thanks for any advice,
>> Josh


External DB as sink - with processing guarantees

2016-03-11 Thread Josh
Hi all,

I want to use an external data store (DynamoDB) as a sink with Flink. It looks 
like there's no connector for Dynamo at the moment, so I have two questions:

1. Is it easy to write my own sink for Flink and are there any docs around how 
to do this?
2. If I do this, will I still be able to have Flink's processing guarantees? 
I.e. Can I be sure that every tuple has contributed to the DynamoDB state 
either at-least-once or exactly-once?

Thanks for any advice,
Josh