Many thanks for your quick reply.

1)      My implementation has no commits. All commits are done in 
FlinkKafkaProducer class I envisage.



KeyedSerializationSchemaWrapper keyedSerializationSchemaWrapper = new 
KeyedSerializationSchemaWrapper(new SimpleStringSchema());

new FlinkKafkaProducer<String>("test.out", 
keyedSerializationSchemaWrapper,KafkaProperities.getProperties(env), 
FlinkKafkaProducer.Semantic.EXACTLY_ONCE);



If the latency could be as long as the interval of checkpoint, it would be not 
ideal for a long interval setting e.g. a few minutes



2)      My parallelism is set on the job level, I would expect they all have 
the same parallelism for each source, operator and sink. Actually, my test code 
only has one kafka source, one map and one kafka sink. It has produced 
duplication in a restart if I use the at least once mode.

Regards,

Min

From: Guowei Ma [mailto:guowei....@gmail.com]
Sent: Sonntag, 7. April 2019 08:42
To: Tan, Min
Cc: user@flink.apache.org
Subject: [External] Re: Lantency caised Flink Checkpoint on EXACTLY_ONCE mode

If your implementation only commits your changing after the complete of a 
checkpoint I think the latency of e2e is at least the interval of checkpoint.

I think the document wants to say that a topology, which only has 
flatmap/filter/map(no  task has more than one input) could achieve the exactly 
once semantics even in at least mode since the effect of barrier alignments in 
at least mode is same as in exactly once mode by coincidence for such topology.

I think there might be some benefits if you could set the parallelism of 
source/sink/flatmap to the same parallelism(there could exist other way) in 
some situation since during the alignments the task, which has many inputs 
would not deal with the elements behind the barrier in exactly mode until the 
barriers of all inputs  arrive.  (If your checkpoint interval is very very long 
I think there would be no difference).


Best
Guowei

<min....@ubs.com<mailto:min....@ubs.com>>于2019年4月7日 周日上午3:14写道:
Hi,

I have a simple data pipeline of a Kafka source, a flink map operator and  a 
Kafka sink.

I have a quick question about latency caused by the checkpoint on the exactly 
once mode.

Due to the changes are committed and visible on a checkpoint completion, so the 
latency could be as long as that length of checkpoint interval e.g. 5seconds?

Is my understanding correct?

If I use the at least mode, there will be this addition on latency.  More 
interestingly, the flink document 
https://ci.apache.org/projects/flink/flink-docs-release-1.7/internals/stream_checkpointing.html
 indicate that "dataflows with only embarrassingly parallel streaming 
operations (map(), flatMap(), filter(), …) actually give exactly once 
guarantees even in at least once mode."

Unfortunately, I have been not able to achieve the exactly once with the at 
least once. Do I need more settings than I have with the exactly once mode?

Many thanks for the advises in advance.

Min




--
Best,
Guowei
E-mails can involve SUBSTANTIAL RISKS, e.g. lack of confidentiality, potential 
manipulation of contents and/or sender's address, incorrect recipient 
(misdirection), viruses etc. Based on previous e-mail correspondence with you 
and/or an agreement reached with you, UBS considers itself authorized to 
contact you via e-mail. UBS assumes no responsibility for any loss or damage 
resulting from the use of e-mails. 
The recipient is aware of and accepts the inherent risks of using e-mails, in 
particular the risk that the banking relationship and confidential information 
relating thereto are disclosed to third parties.
UBS reserves the right to retain and monitor all messages. Messages are 
protected and accessed only in legally justified cases.
For information on how UBS uses and discloses personal data, how long we retain 
it, how we keep it secure and your data protection rights, please see our 
Privacy Notice http://www.ubs.com/privacy-statement

Reply via email to