Hi,

You’re correct that the FlinkKafkaProducer may emit duplicates to Kafka topics, 
as it currently only provides at-least-once guarantees.
Note that this isn’t a restriction only in the FlinkKafkaProducer, but a 
general restriction for Kafka's message delivery.
This can definitely be improved to exactly-once (no duplicates produced into 
topics) once Kafka supports transactional messaging.

On the consumer side, the FlinkKafkaConsumer doesn’t have built-in support to 
dedupe the messages read from topics.
On the other hand this isn’t really feasible, as consumers could basically only 
view messages with different offsets as separate independent messages, unless 
identified by some user application-level logic.
So in the end, we’ll need to rely on the assumption that messages produced into 
Kafka topics are not duplicated, which as explained above, will hopefully be 
available in the near future.

Cheers,
Gordon

On January 14, 2017 at 6:12:29 PM, ljwagerfield (lawre...@dmz.wagerfield.com) 
wrote:

As I understand it, the Flink Kafka Producer may emit duplicates to Kafka  
topics.  

How can I deduplicate these messages when reading them back with Flink (via  
the Flink Kafka Consumer)?  

For example, is there any out-the-box support for deduplicating messages,  
i.e. by incorporating something like "idempotent producers" as proposed by  
Jay Krepps (which, as I understand it, involves maintaining a "high  
watermark" on a message-by-message level)?  



--  
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Deduplicate-messages-from-Kafka-topic-tp11051.html
  
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.  

Reply via email to