Hi Hequn,
Thanks a lot for your response. I have a few questions about this
topic. Would you please help me about it?
1. I have heard a idempotent way but I do not know how to implement it,
would you please enlighten me about it by a example?
2. If dirty data are added but not updated, then only overwrite is not
enough I think.
3. If using two-phase commit, the sink must support transaction.
3.1 If the sink does not support transaction, for example,
elasticsearch, do I have to use idempotent to implement exactly-once?
3.2 If the sink support transaction, for example, mysql,
idempotent and two-phase commit is both OK. But like you say, if there are a
lot of items between checkpoints, the batch insert is a heavy action, I still
have to use idempotent way to implement exactly-once.
Best
Hequn
> 在 2018年10月13日,上午11:43,Hequn Cheng <[email protected]> 写道:
>
> Hi Henry,
>
> Yes, exactly once using atomic way is heavy for mysql. However, you don't
> have to buffer data if you choose option 2. You can simply overwrite old
> records with new ones if result data is idempotent and this way can also
> achieve exactly once.
> There is a document about End-to-End Exactly-Once Processing in Apache
> Flink[1], which may be helpful for you.
>
> Best, Hequn
>
> [1]
> https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html
>
> <https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html>
>
>
>
> On Fri, Oct 12, 2018 at 5:21 PM 徐涛 <[email protected]
> <mailto:[email protected]>> wrote:
> Hi
> I am reading the book “Introduction to Apache Flink”, and in the book
> there mentions two ways to achieve sink exactly once:
> 1. The first way is to buffer all output at the sink and commit this
> atomically when the sink receives a checkpoint record.
> 2. The second way is to eagerly write data to the output, keeping in
> mind that some of this data might be “dirty” and replayed after a failure. If
> there is a failure, then we need to roll back the output, thus overwriting
> the dirty data and effectively deleting dirty data that has already been
> written to the output.
>
> I read the code of Elasticsearch sink, and find there is a
> flushOnCheckpoint option, if set to true, the change will accumulate until
> checkpoint is made. I guess it will guarantee at-least-once delivery, because
> although it use batch flush, but the flush is not a atomic action, so it can
> not guarantee exactly-once delivery.
>
> My question is :
> 1. As many sinks do not support transaction, at this case I have to
> choose 2 to achieve exactly once. At this case, I have to buffer all the
> records between checkpoints and delete them, it is a bit heavy action.
> 2. I guess mysql sink should support exactly once delivery, because
> it supports transaction, but at this case I have to execute batch according
> to the number of actions between checkpoints but not a specific number, 100
> for example. When there are a lot of items between checkpoints, it is a heavy
> action either.
>
> Best
> Henry